Machine learning can be loosely defined as the set of computational methods that use experience to improve performance or make accurate predictions. With the revival of neural networks and the advent of deep learning, machine learning research now focuses on the development of even more complex models, requiring very large training sets and huge amounts of computational power. Understanding the properties of these models and analyzing them is, at least for the moment, receiving less attention in machine learning research. To fill this gap, one possibility is to rely on one of the pillars of machine learning: statistics.
Many machine learning algorithms rely on statistical models, such as ridge and lasso regression. The statistics community has derived precise theoretical results of these models, including their asymptotic properties or the construction of confidence intervals for parameters. They enable model interpretation, robust inference and causal statements. Such results are mostly still missing in machine learning world, because the models that are used there are inherently complex.
In this workshop we bring together the research communities of statistics and machine learning to foster a discussion between the two fields and develop research synergies. The workshop takes place online the whole day of 18/September. The workshop will feature talks by:

Prof. Giacomo De Giorgi, UNIGE, GSEM,

Prof. Francois Fleuret, UNIGE, Computer Science,

Prof. Sylvain Sardy, UNIGE, Mathematics,

Prof. Olivier Scaillet, UNIGE, GSEM,

Prof. Simon Scheidegger, UNIL, HEC,

Dr. Kostantinos Sechidis, Novartis,

Prof. Dimitri Van De Ville, EPFL, UNIGE.
In addition, a number of PhD students will present their work in talks. The detailed program can be found below. You may register for the workshop by sending an email with subject 'Registration' at sl.ws.geneva@gmail.com (please check you spam folder in case you haven't received a reply). Please note that the participation is free of charge, but the registration is mandatory.
The organizers,
Prof. Sebastian Engelke, UNIGE, GSEM
Prof. Alexandros Kalousis, HESSO
Prof. Davide La Vecchia, UNIGE, GSEM
The Workshop's program (video link for each presenter's talk)

8:408:45 Organizers
8:458:55 Prof. Antoine Geissbuhler, Vicerector of the University of Geneva

9:0010:40
 Prof. Olivier Scaillet, UNIGE, GSEM , Statistical analysis of network data: learning from small samples (video)
 Prof Francois Fleuret, UNIGE, EPFL, Fast attention models (video)
 Mr. Alban Moor, UNIGE, GSEM, A HigherOrder Correct Fast MovingAverage Bootstrap for Dependent Data (video)
 Mrs. Frantzeska Lavda, PhD student UNIGE, HESGE, Datadependent conditional priors for unsupervised learning of multimodal data (video)

10:4011:15 Break

11:1512:35
 Prof. Sylvain Sardy, UNIGE, What needles do sparse neural networks find in nonlinear haystacks (video)
 Prof. Simon Scheidegger , UNIL, HEC, Gaussian Process methods for Asset Pricing and Dynamic Portfolio Choice (video)
 Mr. Lionel Blonde, PhD student UNIGE, HESGE, Lipschitzness Is All You Need To Tame Offpolicy Generative Adversarial Imitation Learning (video)

12:3514:15 Lunch break

14:1516:00
 Prof. Giacomo De Giorgi, UNIGE, GSEM, Predict Mortality from Credit Reports
 Mr. Edoardo Vignotto, PhD student UNIGE, GSEM, Towards dynamical adjustment of the full temperature distribution (video)
 Prof. Dimitri van De Ville, UNIGE, EPFL, CommunityAware Graph Signal Processing (video 1, video 2)
 Mr. Cesare Miglioli, PhD student UNIGE, GSEM, SWAG: A Wrapper Method for Sparse Learning (video)

16:0016:30 Break

16:3017:45

 Mrs. Amina Mollaysa, UNIGE, HESGE, Goaldirected Generation of Discrete Structures with Conditional Generative Models (video)
 Mr. Nicola Gnecco , UNIGE, GSEM, Causal discovery in heavytailed models (video)
 Dr Kostantinos Sechidis, Novartis, From feature selection to predictive biomarker discovery
17:4517:50 Concluding remarks
Abstracts
Prof. Olivier Scaillet, Statistical analysis of network data: learning from small samples
For small samples network (spatial panels) data, we illustrate that firstorder asymptotic theory suffers from finite sample distortions. We develop saddlepoint techniques which perform well in small samples and feature higherorder accuracy. We deploy the new higherorder asymptotic techniques for the Gaussian maximum likelihood estimator in a spatial panel data model, with fixed effects, timevarying covariates, and spatially correlated errors. Our saddlepoint density and tail area approximation feature relative error of order O(m^(1)) for m = n(T1) with n being the crosssectional dimension and T the timeseries dimension. The main theoretical tool is the tiltedEdgeworth technique in a nonidentically distributed setting. The density approximation is always nonnegative, does not need resampling, and is accurate in the tails. We compare the saddlepoint density approximation to Edgeworth expansion and to the asymptotic approximation. An empirical application to the investmentsaving relationship in OECD (Organisation for Economic Cooperation and Development) countries shows disagreement between testing results based on firstorder asymptotics and saddlepoint techniques.
This talk is based on the working paper "Saddlepoint approximations for spatial panel data models" by Chaonan Jiang, Davide La Vecchia, Elvezio Ronchetti, and Olivier Scaillet
Prof. Francois Fleuret, Fast attention models
Deep neural networks based on attention mechanisms have become the standard models for natural language processing, and they also demonstrate very promising results in computer vision and other application domains. Their impressive performance requires however extremely high computational cost, due to the very structure of the attention layers.
In this talk I will give a rapid introduction to attention mechanisms and transformer architectures, and present our recent contributions that aim at reducing their computational cost.
Mr. Alban Moor, A HigherOrder Correct Fast MovingAverage Bootstrap for Dependent Data.
We develop the theory of a novel fast bootstrap for dependent data. Our scheme is based on the i.i.d. resampling of the smoothed moment indicators. We characterize the class of parametric and semiparametric estimation problems for which the method is valid. We show the asymptotic refinements of the proposed procedure, proving that it is higherorder correct under mild assumptions on the time series, the estimating functions, and the smoothing kernel. We illustrate the applicability and the advantages of our procedure for Mestimators, generalized method of moments and generalized empirical likelihood estimation method. In a Monte Carlo study we consider an autoregressive conditional duration model and we compare our method with other existant, routinelyapplied first and higherorder correct methods. The results provide numerical evidence that the novel bootstrap yields higherorder accurate confidence intervals, while remaining computationally lighter than its higherorder competitors. A realdata example on dynamics of trading volume of US stocks illustrates the applicability of our method.
This is joint work with D. La Vecchia and O. Scaillet
Mrs. Frantzeska Lavda, Datadependent conditional priors for unsupervised learning of multimodal data
One of the major shortcomings of variational autoencoders is the inability to produce generations from the individual modalities of data originating from mixture distributions. This is primarily due to the use of a simple isotropic Gaussian as the prior for the latent code in the ancestral sampling procedure for data generations. In this paper, we propose a novel formulation of variational autoencoders, conditional prior VAE (CPVAE), with a twolevel generative process for the observed data where continuous and discrete variables are introduced in addition to the observed variables. By learning datadependent conditional priors, the new variational objective naturally encourages a better match between the posterior and prior conditionals, and the learning of the latent categories encoding the major source of variation of the original data in an unsupervised manner. Through sampling continuous latent code from the datadependent conditional priors, we are able to generate new samples from the individual mixture components corresponding, to the multimodal structure over the original data. Moreover, we unify and analyse our objective under different independence assumptions for the joint distribution of the continuous and discrete latent variables. We provide an empirical evaluation on one synthetic dataset and three image datasets, FashionMNIST, MNIST, and Omniglot, illustrating the generative performance of our new model comparing to multiple baselines.
This is joint work with Magda Gregorova and Alexandros Kalousis
Prof. Sylvain Sardy, What needles do sparse neural networks find in nonlinear haystacks
Inspired by LASSO, we regularize with an L1penalty the estimation of an artificial neural network (ANN). This has the advantage of performing variable selection (e.g., gene selection or feature selection in an image) and avoiding overfitting. The selection of the regularization parameter is done with the Quantile Universal Threshold method. This method requires no estimation of the noise variance, no expensive crossvalidation and can retrieve the needles and the sparsity structure of the ANN in certain regimes.
Prof. Simon Scheidegger, Gaussian Process methods for Asset Pricing and Dynamic Portfolio Choice
In this paper, we consider the portfolio optimization problem for a multiperiod investor who seeks to maximize her utility facing multiple risky assets and proportional transaction costs in the presence of return predictability. Due to the curse of dimensionality, this problem is challenging to solve, even numerically. To this end, we propose to embed Gaussian Process regression in combination the active subspace method and Bayesian active learning inside a parallelized dynamic programming algorithm. Preliminary results will show that with this generic setup, we push the boundary of the current state of the art in the literature along several dimensions. The said combination of tools allows us to study important open problems in this literature, including (i) the characterization of notrade regions (potentially volume and welfare implications in economies with several assets), and (ii) the optimal portfolio behavior in economies with a stochastic opportunity set or stochastic frictions.
Joint work with Fabio Trojani.
Mr. Lionel Blonde, Lipschitzness Is All You Need To Tame Offpolicy Generative Adversarial Imitation Learning
Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyperparameters and are often riddled with essential engineering feats allowing their success. We consider the case of offpolicy generative adversarial imitation learning, and perform an indepth review, qualitative and quantitative, of the method. Crucially, we show that forcing the learned reward function to be local Lipschitzcontinuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the statevalue function. Finally, we propose a novel rewardmodulation technique inspired from a new interpretation of gradientpenalty regularization in reinforcement learning. Besides being extremely easy to implement and bringing little to no overhead, we show that our method provides improvements in several continuous control environments of the MuJoCo suite.
Joint work with Pablo Strasser, Alexandros Kalousis
Prof. Giacomo De Giorgi, Predict Mortality from Credit Reports
Data on hundreds of variables related to individual consumer finance behavior (such as credit card and loan activity) is routinely collected in many countries and plays an important role in lending decisions. We postulate that the detailed nature of this data may be used to predict outcomes in unrelated domains such as individual health. We build a series of machine learning models to demonstrate that credit report data can be used to predict individual mortality. Variable groups related to credit cards and various loans, mostly unsecured loans, are shown to carry significant predictive power. Lags of these variables are also significant thus indicating that dynamics also matters. Improved mortality predictions based on consumer finance data can have important economic implications in insurance markets but may also raise privacy concerns.
Mr. Edoardo Vignotto, Towards dynamical adjustment of the full temperature distribution
Internal variability due to atmospheric circulation can dominate the thermodynamical signal present in the climate system for small spatial or short temporal scales, thus fundamentally limiting the detectability of forced climate signals. Dynamical adjustment techniques aim to enhance the signaltonoise ratio of trends in climate variables such as temperature by removing the influence of atmospheric circulation variability. Forced thermodynamical signals unrelated to circulation variability are then thought to remain in the residuals, allowing a more accurate quantification of changes even at the regional or decadal scale. The majority of these methods focus on climate variable’s averages, thus discounting important distributional features. Here we propose a machine learning dynamical adjustment method for the full temperature distribution that recognizes the stochastic nature of the relationship between the dynamical and thermodynamical components. Furthermore, we illustrate how this method enables evaluating how specific events would have unfolded in a different, counterfactual climate from a few decades ago, thereby characterizing the emergent effect of climatic changes over decadal time scales. We apply our method to observational data over Europe and over the last 70 years.
This is joint work with S. Sippel, F. Lehner, and E. M. Fischer.
Dr Kostantinos Sechidis, From feature selection to predictive biomarker discovery
One of the key challenges of personalised medicine is to identify which patients will respond positively to a given treatment. The area of subgroup identification focuses on this challenge, i.e. identifying groups of patients that experienced enhanced treatment effect even if the study failed to show an effect in the overall population. A crucial first step towards the subgroup identification is to identify the variables (e.g. biomarkers) that modify the treatment effect, known as predictive biomarkers. In this talk we will connect the problem of predictive biomarker discovery with the problem of supervised feature selection, which occurs when we observe a response variable together with a large number of features, and we would like to know which variables are truly associated with the response. Furthermore, we will review a recent method for performing feature selection while controlling the false discovery rate (FDR)  the expected fraction of variables falsely selected among all discoveries. Finally, we will provide some insights on how to perform controlled predictive biomarker discovery.
Mr. Cesare Miglioli, SWAG: A Wrapper Method for Sparse Learning
Predictive power has always been the main research focus of learning algorithms. While the general approach for these algorithms is to consider all possible attributes in a dataset to best predict the response of interest, an important branch of research is focused on sparse learning. Indeed, in many practical settings we believe that only an extremely small combination of different attributes affect the response. However even sparselearning methods can still preserve a high number of attributes in highdimensional settings and possibly deliver inconsistent prediction performance. The latter methods can also be hard to interpret for researchers and practitioners, a problem which is even more relevant for the ``blackbox''type mechanisms of many learning approaches. Finally, there is often a problem of replicability since not all datacollection procedures measure (or observe) the same attributes and therefore cannot make use of proposed learners for testing purposes. To address all the previous issues, we propose to study a procedure that combines screening and wrapper methods and aims to find a library of extremely lowdimensional attribute combinations (with consequent low data collection and storage costs) in order to (i) match or improve the predictive performance of any particular learning method which uses all attributes as an input (including sparse learners); (ii) provide a lowdimensional network of attributes easily interpretable by researchers and practitioners; and (iii) increase the potential replicability of results due to a diversity of attribute combinations defining strong learners with equivalent predictive power. We call this algorithm ``Sparse Wrapper AlGorithm'' (SWAG).
Joint work with R. Molinari, G. Bakalli, S. Guerrier, S. Orso and O. Scaillet
Mrs. Amina Maolaisha, Goaldirected Generation of Discrete Structures with Conditional Generative Models
Despite recent advances, goaldirected generation of structured discrete data remains challenging. For problems such as program synthesis (generating source code) and materials design (generating molecules), finding examples which satisfy desired constraints or exhibit desired properties is difficult. In practice, expensive heuristic search or reinforcement learning algorithms are often employed.
We investigate the use of conditional generative models which directly attack this inverse problem, by modeling the distribution of discrete structures given properties of interest. Unfortunately, maximum likelihood training of such models often fails with the samples from the generative model inadequately respecting the input properties. To address this, we introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We avoid highvariance scorefunction estimators that would otherwise be required by sampling from an approximation to the normalized rewards, allowing simple Monte Carlo estimation of model gradients. We test our methodology on two tasks: generating molecules with userdefined properties and identifying short python expressions which evaluate to a given target value. In both cases, we find improvements over maximum likelihood estimation and other baselines.
Joint work with Brooks Paige and Alexandros Kalousis
Mr. Nicola Gnecco, Causal discovery in heavytailed models
Causal questions are omnipresent in many scientific problems. While much progress has been made in the analysis of causal relationships between random variables, these methods are not well suited if the causal mechanisms manifest themselves only in extremes. This work aims to connect the two fields of causal inference and extreme value theory. We define the causal tail coefficient that captures asymmetries in the extremal dependence of two random variables. In the population case, the causal tail coefficient is shown to reveal the causal structure if the distribution follows a linear structural causal model. This holds even in the presence of latent common causes that have the same tail index as the observed variables. Based on a consistent estimator of the causal tail coefficient, we propose a computationally highly efficient algorithm that infers causal structure from finitely many data. We prove that our method consistently recovers the causal order and compare it to other wellestablished and nonextremal approaches in causal discovery on synthetic and real data. The code is available as an openaccess R package on Github.
This is joint work with N. Meinshausen, J.Peters, and S. Engelke.
Prof. Dimitri van De Ville, CommunityAware Graph Signal Processing
The emerging field of graph signal processing (GSP) allows to transpose classical signal processing operations (e.g., filtering) to signals on graphs. The GSP framework is generally built upon the graph Laplacian, which plays a crucial role to study graph properties and measure graph signal smoothness. Here instead, we propose the graph modularity matrix as the centerpiece of GSP, in order to incorporate knowledge about graph community structure when processing signals on the graph, but without the need for community detection. We study this approach in several generic settings such as filtering, optimal sampling and reconstruction, surrogate data generation, and denoising. Feasibility is illustrated by a smallscale example and a transportation network dataset, as well as one application in human neuroimaging where communityaware GSP reveals relationships between behavior and brain features that are not shown by Laplacianbased GSP. This work demonstrates how concepts from network science can lead to new meaningful operations on graph signals.
Joint work with Miljan Petrovic, Raphaël Liégeois, Thomas Bolto