Turboloading dark matter searches with differentiable programming

# Turboloading dark matter searches with differentiable programming

.smaller.center[Christoph Weniger (GRAPPA/IoP/UvA)]

.center[
.width-25[![](figures/gallery_obs.png)]
.width-25[![](figures/gallery_fit.png)]
.width-25[![](figures/gallery_true_src.png)]
]

GSSI Colloquium]

---

class: black-slide
background-image: url(figures/pietro-jeng-n6B49lTx7NM-unsplash.jpg)
background-size: cover

# AI and Dark Universe at GRAPPA

.block.grid[
.kol-1-2[
.center.bold[Data]
- Gravitational waves
- Stellar streams
- Neutron star radio emission
- **Strong lensing images**
- Gamma-ray data (Fermi, CTA)
- Cosmology
]
.kol-1-2[
.center.bold[Methods]
- **Variational inference**
- Differentiable programming
- Probabilistic programming
- GPU acceleration
- **Simulation-based inference**
]
]

.center[
.width-20[![](figures/erc_logo_0.jpg)]
.width-40[![](figures/escience_center.jpg)]
.width-30[![](figures/ai4science.png)]
]

---

# The energy budget of the Universe

## What is Dark Matter made of?

---

# Dark matter candidates

---

class: black-slide, center, bottom
background-image: url(figures/illustris_box_z0124_4panel.png)
background-size: contain

.center.block[

## Evolution of structure in the Universe (bottom to top)

Dark matter overdensities collapse into increasily large halos.

Large enough halos are sites of star and galaxy formation.

Details depend on the particle nature of dark matter!

]

---

# Dark matter mass vs substructure

Hot, warm and cold dark matter (from left to right)

at early times and late times (top to bottom)

---

# Dark matter mass vs substructure

Amount of substructure depends on the dark matter particle mass.

.red.center[Absence of substructure implies that the Cold Dark Matter paradigm is wrong.]

.footnote[Image credit: Chianese, Marco. 2019. “Differentiable Probabilistic Programming for Strong Gravitational Lensing.” arXiv [astro-ph.CO]. arXiv. https://doi.org/10.22323/1.358.0515]

---

class: black-slide, center
background-image: url(figures/A_Horseshoe_Einstein_Ring_from_Hubble.jpeg)
background-size: cover

# Strong gravitational lensing

## A "horseshoe Einstein Ring from Hubble"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The gravity of a luminous red galaxy has gravitationally distorted the light from a much more distant blue galaxy.

---

# Strong lensing

.grid[
.kol-2-3[
.center[.width-100[![](figures/lensing_overview.png)]]
]
.kol-1-4[
Lensed quasar
.width-100[![](figures/lensing_multi_quasars.png)]

Lensed galaxy
.width-100[![](figures/lensing_galaxy.png)]
]
]

---
exclude: true

# Weak and strong lensing

---

# The effect of dark matter substructure

---

# The goal

.center.large[Constrain the mass of dark matter particles by constraining the amount of dark matter substructure in the images.]

---

# First approaches using neural networks

Example: Brehmer+ 2019.  The authors train classification networks to estimate posteriors for subhalo parameters.

.center.width-80[![](figures/Brehmer_posteriors.png)]

Very promising in context of this toy scenario

.red.center[How can this method be successfully applied to real complex images?]

.footnote[From Brehmer, Johann, Siddharth Mishra-Sharma, Joeri Hermans, Gilles Louppe, and Kyle Cranmer. 2019. “Mining for Dark Matter Substructure: Inferring Subhalo Population Properties from Strong Lenses with Machine Learning.” arXiv [astro-ph.CO]. arXiv. http://arxiv.org/abs/1909.02005]

---

# Problem: DM signal << Image variance

.center[Effect of subhalos at different positions]
]
.kol-1-2[
.bold.center[Variance between images is .underline[very large]]

.center.width-100[![](figures/213372main_67_hubble_lg.jpg)]

---

# Variance between images is .underline[very large]

.center.width-80[![](figures/213372main_67_hubble_lg.jpg)]

---

background-image: url(figures/A_Horseshoe_Einstein_Ring_from_Hubble.jpeg)
background-size: contain

.block[
# How can modern machine learning techniques help to perform .underline[precision] analyses of strong lensing images?
]

---

# Bayesian inference

---

# Observed data $x$ vs latent space $z$

.grid[
.kol-1-3[.bold[Gravitational lensing] .width-100[![](figures/A_Horseshoe_Einstein_Ring_from_Hubble.jpeg)]
- $x$: image
- $z$: DM distribution, source light distribution, ...
]
.kol-1-3[.bold[Gravitational waves] .width-100[![](https://cdn.mos.cms.futurecdn.net/fNo6JvmhpmL3qL3nnMSVzK-970-80.jpeg)]
- $x$: waveform
- $z$: BH masses, BH spins, distance, sky position, ...
]
.kol-1-3[.bold[Particle colliders] .width-100[![](https://assets.newatlas.com/dims4/default/1299ab9/2147483647/strip/true/crop/800x533+0+1/resize/1200x800!/format/webp/quality/90/?url=http%3A%2F%2Fnewatlas-brightspot.s3.amazonaws.com%2Farchive%2Fquantizer-1.jpeg)]
- $x$: measured four-vectors
- $z$: MSSM parameters, all details of particular particle collision 
]
]

---

# Forward model vs generative model

.large.block[

- $P(x|z)$: **Forward model** or **Likelihood**

(Physics + detector simulators)

- Prior knowledge provides parameter **priors** $P(z)$
- This defines **generative model** $P(x, z) \equiv P(x|z)P(z)$
]

---

# Solving the inverse problem

.center.red[Bayes' theorem provides a rule for updating believes in light of new data]

.grid[
.kol-1-2[
## Likelihood/simulator $P(x|z)$
How probable is data $x$ assuming that parameters $z$ are true?
]
.kol-1-2[
## Prior $P(z)$
How probable are the parameters $z$?
]
]

$$
P(z|x) = \frac{P(x|z) P(z)}{P(x)}
$$

.grid[
.kol-1-2[
## Posterior $P(z|x)$
How probably are our parameters $z$ given the observed data $x$?
]
.kol-1-2[
## Marginal likelihood $P(x)$
How probable is the new data $x$ under all possible hypothesis $z$?

$$
P(x) \equiv \int dz\, P(x|z) P(z)
$$
]
]

---

# The inverse problem

## Bayes' theorem

$$
P(z|x) = \frac{P(x|z) P(z)}{P(x)}
$$

$P(x|z)$
]
.kol-1-5.center[
Prior

$P(z)$
]
.kol-1-5.centerj[
Posterior

$P(z|x)$
]
.kol-2-5.center[
Evidence

$P(x) \equiv \int dz\, P(x|z) P(z)$
]
]

---

This gives the **joined posterior** for the full latent space $z\in \mathbb{R}^n$.  Almost always, we want however the **marginal posterior**, say, for $\theta\in \mathbb{R}$, where $z\equiv(\theta, \eta)$ and $\eta \in \mathbb{R}^{n-1}$ are nuisance parameters.

$$
P(\theta|x) = \int d\eta\, P(\theta, \eta|x) 
$$

Alas, this integral is usually **intractable**.

---

# Markov Chain Monte Carlo

.center.red[The method-of-choice for inverse problem solving MCMC]

.center.width-55[![](figures/3dRosenbrock.png)]

.center[We are usually interested in 1-dim or 2-dim projections of the high-dim chain (the parameters of interest)]

[Chi-Feng animations](https://chi-feng.github.io/mcmc-demo/app.html?algorithm=RandomWalkMH&target=banana)

.footnote[Image credit: https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#/media/File:3dRosenbrock.png]

---

# Marginal posteriors

- Strong lensing images have 10000s of parameters

$$
\vec z = (\vec z\_\text{main lens}, \vec z\_\text{source}, \vec z\_\text{substructure})
$$

- One is seldomly interested in the full $P(z|x)$
- Instead, we are usually interested in marginalized version, e.g.
  $$
  P(m\_\text{dm}|x) = \int dz\_l\, dz\_s\, dz\_h\, p(x|z\_l, z\_s, z\_h) p(z\_l) p(z\_s) p(z\_h|m\_\text{dm}) p(m\_\text{dm})
  $$
- Problem, what to do with this massive integral?

.center[The "curse of dimensionality"]
]
.kol-1-2[
.center.width-40[![](figures/Bimodal-bivariate-small.png)]

---

# Markov Chain Monte Carlo

.grid[
.kol-1-2[
## Methods
- Metropolis Hastings
- Gibbs sampling
- Nested sampling
- Hamiltonian Monte Carlo
- others
]
.kol-1-2[
## Visualization
Typical "corner plot" that shows 1-dim and 2-dim posteriors
.right[.width-90[![](https://www.researchgate.net/profile/Carmelita_Carbone/publication/321719035/figure/fig1/AS:614186346352656@1523444844641/Joint-posterior-distribution-obtained-using-Planck-temperature-and-polarization-data-and.png)]]
]
]

---

# Challenges for MCMC

## 1. MCMC does not scale well to extremely large number of parameters

.center.width-60[![](figures/curse_of_dim.png)]

.center[In high dimensions, $z\in \mathbb{R}^n$ with $n \gg 1$, the volume in the center partition becomes negligible compared to the neighboring volume.]

## 2. MCMC does struggle with multi-model distributions

.center.width-30[![](figures/Bimodal-bivariate-small.png)]

---

# Classical inference methods

.bold.center[The two faces of $P(x|z)$]

$P(x|z)$ gives the probability of observing $x$ given parameters $z$.

$$
P_{\mathcal{N}}(x=0.3|\mu = 0., \sigma = 1) \simeq 0.38
$$

.bold[Typical algorithms]
  - chi2 fits
  - Markov Chain Monte Carlos (Metropolis Hastings, Hamiltonian MC, etc)

]
.kol-1-2[
.red.bold[Sampler]

Sampling synthetic observations $x\sim P(x|z)$ given parameters $z$.

$$
-0.24, 0.91 \sim P\_{\mathcal{N}}(x|\mu = 0., \sigma = 1)
$$

- Approximate Bayesian computation

]
]

.grid[
.kol-1-2[
.bold[Properties]
  - Asymptotically optimal
  - Marginalization = Integration : costly
]
.kol-1-2[
.bold[Properties]
  - Requires choice of summary statistics
  - Marginalization = Ignoring : cheap
]
]

---

background-image: url(figures/pietro-jeng-n6B49lTx7NM-unsplash.jpg)
background-size: cover

# Modern machine learning power-ups

.red.block.center.large[
| Category | Traditional method | Modern ML |
|------------------|------|------|
| Likelihood-based | Markov chain monte carlo, Profile likelihood methods, ... | Variational Inference |
| Likelihood-free | Approximate Bayesian Computation | Simulator based neural inference |
]

.block.center[Under the hood, modern methods are based on the ability of solving massive high-dimensional optimization problems]

---

# Variational inference

## Reconstruction of the main lens and source parameters of a gravitational lensing system

---

# Machine learning super powers

## Likelihood-based inference --> Variational inference

Instead of sampling from the high-dimensional joint posterior, we fit it using
auto-differentiation and gradient descent.

## Sampling based inference, ABC --> Neural simulator based inference

Instead of hand-crafting test statistics and thresholds, we automatically optimize them by using auto-differentiation and gradient descent.

---

# Variational Bayesian (VI) methods

.center[Let's find some fitting function $Q(z)$, aka **variational posterior**, such that $Q(z) \approx P(z|x)$]

.center[![](https://4.bp.blogspot.com/-OCU72-Cp5lg/V6fxbBAV4oI/AAAAAAAAFE4/BMcR5OYwZqwARnqFnm3I9I_S46O-IH-uQCLcB/s1600/Untitled%2Bpresentation%2B%25282%2529.png)]

.footnote[See Carleo, Giuseppe et al., 2019. “Machine Learning and the Physical Sciences.” http://arxiv.org/abs/1903.10563; Excellent overview: Zhang, Cheng et al. 2017. “Advances in Variational Inference.” http://arxiv.org/abs/1711.05597; also Cranmer, Kyle, Johann Brehmer, and Gilles Louppe. 2019. “The Frontier of Simulation-Based Inference.” http://arxiv.org/abs/1911.01429; Cranmer, Kyle, Juan Pavez, and Gilles Louppe. 2015. “Approximating Likelihood Ratios with Calibrated Discriminative Classifiers.” http://arxiv.org/abs/1506.02169]

---

# The distance between distributions

## Kullback-Leibler divergence

$$
D_{\rm KL}(P||Q)  = \int p(x) \ln \left(\frac{p(x)}{q(x)}\right)dx
$$

.center.width-60[![](figures/KL_example.png)]

## Properties
- $D_{\rm KL}(P||Q) = 0$ $\Leftrightarrow$ $p(x) = q(x)$
- $D\_{\rm KL}(P||Q) \geq 0$

---

# Failure modes of the KL divergence

- "mode seeking"
- good when aim is **good fit** to data

## Forward KL divergence

- "mass covering"
- good when aim is **conservative parameter constraints**
]
.kol-3-5[
.width-100[![](http://timvieira.github.io/blog/images/KL-inclusive-exclusive.png)]
]
]

Nomenclature adopted from Ambrogioni, Luca et al. 2018. “Forward Amortized Inference for Likelihood-Free Variational Marginalization.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1805.11542]

---

# Variational inference in practice

## Evidence lower bound

$$
\text{ELBO} \equiv p(x\_0) - D\_{KL}(q\_\phi||p)
$$

ELBO maximization w.r.t. variational parameters $\phi$ yields variational posterior
$$q\_\phi(z|x_0)\approx p(z|x_0)$$
]

The ELBO can be written in a tractable form as

$$
\text{ELBO} = \underbrace{\mathbb{E}\_{z \sim q\_\phi(z|x\_0)} \left[\ln p(x\_0|z)p(z)\right]}\_{\rm "Negative\; energy"} - \underbrace{\mathbb{E}\_{z \sim q\_\phi(z|x\_0)} \left[\ln q\_\phi(z|x\_0)\right]}\_{\rm Entropy}
$$

- First term (negative energy): Rewards a good fit to data
- Second term (entropy): Penalizes overfitting / narrow variational posteriors

.footnote[See e.g. Hoffman et al. 2016, “ELBO Surgery: Yet Another Way to Carve up the Variational Evidence Lower Bound.” http://approximateinference.org/accepted/HoffmanJohnson2016.pdf]

---

# Differentiable programming

## Variational posterior $q_\phi(z|x_0)$ depends on large number of fitting parameters $\phi$
- If $q$ is MVN, we have $n^2+n$ fitting parameters for $n$-dim latent space $z$
- This can be mitigated by making simplifying assumptions, e.g. neglecting parameter correlations
.center.width-35[![](figures/renyi.png)]

---

# A balance of three forces

.center.width-95[![](figures/ELBO_LucaA.jpg)]

---

# .underline[Differentiable programming]

We need to be able to quickly calculate derivatives of the forward model, $\nabla_z p(x|z)$
]

.center[
## This can be efficiently achieved using auto-grad programming frameworks (pytorch, JAX) from ML
]

.center.width-80[![](figures/AutomaticDifferentiationNutshell.png)]

---

# Example: Water height simulator

.center.bold[Task: Solve the initial value problem for water flowing into the Taichi symbol]

.center.width-40[![](figures/DiffTaichi_watersim.gif)]

.footnote[See: https://syncedreview.com/2020/01/13/introducing-difftaichi-a-differentiable-programming-language-tailored-for-physical-simulation/]

---

# Approximations

.center.width-35[![](figures/renyi.png)]

Posterior factorizes
$$
q(z|x\_0) = \prod\_{i=1}^N q\_i(z\_i|x\_0)
$$

---

# Back to strong lensing

## The goal using VI: reconstruction of the source image and the main lens properties

.center[
.width-30[![](figures/NeurIPS_Fig1a.png)]
.width-30[![](figures/NeurIPS_Fig1c.png)]
.width-30[![](figures/NeurIPS_Fig1d.png)]
]

---

## In order to enable the use of VI, we implemented our pipeline in `pytorch`

- GPU-based simulation of high-resolution images in a split second
- Likelihood gradients w.r.t. hundred thousand of parameters with O(1) overhead

---

# The strong lensing model

- Pixelated source galaxy image $\rightarrow$ 10000 parameters
- Analytic mass distribution in the lens $\rightarrow$ 10 parameters

.center.bold[We assume that lens and source posteriors are *Gaussian and neglect parameter correlations*]

---

# Modeling the source as Gaussian process

- Hyper-parameter optimization through maximization of marginal likelihood
- Kernel is simple in source plane
- But: Lensing causes complicated kernal structure in image plane

$\Rightarrow$ **existing approaches/libraries were insufficient for our use-case**

---

# Factorized Gaussian process

## Covariance of image pixels from indicator functions and GP kernel
$$
K_{ij}  = k(\mathbf{\vec{p}}_i - \mathbf{\vec{p}}_j)
= \int \text{d}\vec{x}_1 \text{d}\vec{x}_2
\ f_i(\mathbf{\vec{p}}_i - \vec{x}_1)
\ k_0(\vec{x}_1 - \vec{x}_2)
\ f_j(\mathbf{\vec{p}}_j - \vec{x}_2)
$$

.grid[
.kol-2-5[
Approximate factorization
$$
K \approx T T^T
$$
allows to write
$$
\mu\_i = \sum\_j T_{ij} y_j
$$
$$
y_j \sim \mathcal{N}(0, 1)
$$
]
.kol-3-5[
.width-80[![](figures/GP_source.png)]
]
]

---

# Factorization of covariance matrix

The intergral
$$
K_{ij}  = k(\mathbf{\vec{p}}_i - \mathbf{\vec{p}}_j)
= \int \text{d}\vec{x}_1 \text{d}\vec{x}_2
\ f_i(\mathbf{\vec{p}}_i - \vec{x}_1)
\ k_0(\vec{x}_1 - \vec{x}_2)
\ f_j(\mathbf{\vec{p}}_j - \vec{x}_2)
$$
becomes easy to evaluate if we approximate pixel indicator functions with multivariate Gaussians.

- .red[Integral can be expressed in terms of multivariate Gaussian].
- By replacing $k_0$ with $\sqrt{k_0}\sqrt{k_0}$, we can factorize $K$.

---

# Exact vs approx. correlation coefficients

.center.width-60[![](figures/windowing_c_idist.png)]

---

# Reverse KL divergence

Goal: Find $q_\phi(z|x_0)$ for some observation $x_0$ by minimize the KL divergence at $x=x_0$

One can show that minimizing KL divergence is equivalent to maximizing **Evidence Lower Bound (ELBO)**

$$
\text{ELBO} \equiv \underbrace{\mathbb{E}\_{z \sim q\_\phi(z|x\_0)} \left[\ln p\_\theta(x\_0,z)\right]}\_{\rm "Negative\; energy"} - \underbrace{\mathbb{E}\_{z \sim q\_\phi(z|x\_0)} \left[\ln q\_\phi(z|x\_0)\right]}\_{\rm Entropy}
$$

ELBO maximization w.r.t. model parameters $\theta$ and variational parameters $\phi$ leads to
- Variational posterior $q\_\phi(z|x_0)$ for data $x_0$
- Large model evidence $p\_\theta(x_0)$

.footnote[See e.g. Hoffman et al. 2016, “ELBO Surgery: Yet Another Way to Carve up the Variational Evidence Lower Bound.” http://approximateinference.org/accepted/HoffmanJohnson2016.pdf]

---

# Exact vs approx. GP

.grid[
.kol-1-2[
## Properties of our approximate GP
- No log-determinant evaluation (happens on-the-fly as part of ELBO maximization)
- Can model complex kernels
- Reasonable agreement with exact Gaussian process regression
]
.kol-1-2[
.center.width-100[![](figures/gp-nonoise.png)]
]
]

---

# Result: Image reconstruction with VI

.center.width-100[![](figures/lens_toy_img-rec.png)]

- We quickly can find lensing parameters and source parameters that reproduce the image
- Hyper-parameter fits optimize variations in the source

---

# Result: Source reconstruction with VI

.center[
.width-35[![](figures/lens_toy_source_true.png)] .width-55[![](figures/lens_toy_gp_fit_err.png)]
]

---

# Source layers differ in smoothness

## Mean values
.center.width-80[![](figures/lens_toy_gp_layers_mean.png)]

## Uncertainties
.center.width-80[![](figures/lens_toy_gp_layers_err.png)]

---

# More examples with complex galaxies

.center.bold[Works great!]

---

class: middle, black-slide
background-image: url(figures/zoltan-alps.jpg)
background-size: cover

## Problem: Subhalos configurations are permutation symmetric (labeling invariant)

- 40 subhalos can be relabled in
  $$
  40! = 815915283247897734345611269596115894272000000000
  $$
  different ways.
- This leads to $\sim10^{48}$ modes in the joint posterior, which is challenging to handle with likelihood-based techniques that explore the joint posterior.
- In reality, not even the number of subhalos is known, and has to be marginalized in the fit as well.

.red.center[
Although VI works great for the source and the main lens, 
it is very hard to use for large numbers of dark matter subhalos.]
]

.footnote[However, proress is possible with trans-dimensional sampling, see .e.g. Brewer+ 1508.00662 and Daylan+ 1706.06111, using trans-dimensional sampling.  Photo by Zoltan Tasi on Unsplash]

---

# Full correlated posteriors

---

# Simulator-based inference

## Analysing subhalo population

---

# Towards targeted inference networks

.center.red[Training data can now be generated by sampling from the posterior predictive distribution]

.center.width-100[![](figures/ppd_samples.gif)]

---

# Our strategy: Targeted training

.grid[
.kol-1-3[
.center[1) Pre-fit of image with variational inference]
.center.width-80[![](figures/NeurIPS_Fig2a.png)]
]
.kol-1-3[
.center[2) Generate targeted NN training data]
.center.width-80[![](figures/ppd_samples-left.gif)]
]
.kol-1-3[
.center[3) Train neural network to infer substructure]
.center.width-80[![](figures/NeurIPS_Fig2b.png)]
]
]

How to generate training data for targeted training?

]

---

# Our strategy: Targeted training

$$
p(z_h|x) = \int dz_s dz_l p(x| z_h, z_l, z_s) p(z_s) p(z_l) p(z_h)
$$
.center[model parameters: $z_l$ (lens, 10-dim), $z_s$ (source, 10000-dim), $z_h$ (subhalos, 3-10000 dim)]

## Strategy

1. Use Variational inference to obtain approximate posteriors for lens and source parameters (neglect small subhalo effects).
  $$
  p(z_l, z_s|x) \approx \int dz_s\, dz_l\, p(x|z_l, z_s) p(z_s) p(z_l)
  $$
  Use this to constrain the prior, $p(z_l)p(z_s) \to p_c(z_l)p_c(z_s)$ to regions that roughly match the image.

2. Use constrained posteriors to generate training data for a targeted inference network.
  $$
  x, z_h \sim p(x|z_l, z_s, z_h) p_c(z_l) p_c(z_s) p(z_h)
  $$
  Marginalizatoin is done by simply ignoring eveything that is not of interest.

---

# Parameter regression with ConvNets

## A strong lensing image anlaysis example

.center[
.width-40[![](figures/Hezaveh2017_Fig2.png)]
.width-40[![](figures/Hezaveh2017_Fig1a.png)]
]

.center[Convolutional neural networks can be trained to super-quickly determine, e.g., the Einstein radius given a strongly lensed galaxy image.]

---

# Neural networks are fitting functions

## Neural network

A neural network is a complex (often piece-wise linear) fitting function
$$
x = f\_\phi(z)
$$
which maps $n$-dim inputs $ z \in \mathbb{R}^{n}$ on $m$-dim outputs $ x \in \mathbb{R}^{m} $.

The fitting function has **lots** of free parameters, $\phi \in \mathbb{R}^N$.

## Training of neural network

Given $N$ training examples $x\_i, z\_i$, with $i = 1, ..., N$,
the goal is to find network parameters $\phi$ such that

$$
x\_i \approx f\_\phi(z\_i) \quad \text{for all} \quad i = 1, ..., N
$$

---

# Neural networks

.center[
Neural networks are flexible fitting functions, $x = f\_\phi(z)$,
which maps $n$-dim inputs $ z \in \mathbb{R}^{n}$ on $m$-dim outputs $ x \in \mathbb{R}^{m} $.
]

.center.width-65[![](figures/neural_net_simple.png)]

.footnote[Image credit: https://medium.com/fintechexplained/neural-networks-a-solid-practical-guide-9f343594b02a]

---

# Training via stochastic gradient descent

$$
\ell = \sum\_{i=1}^N (f\_\phi(z\_i) - x\_i)^2
$$

(here mean square error)
]
]
.kol-1-2[
.center[
**Update rule**
$$
\phi \to \phi - \epsilon \nabla\_\phi \ell
$$
]
]
]

.center.width-70[![](figures/descent.png)]

.footnote[Image credit: https://suniljangirblog.wordpress.com/2018/12/13/variants-of-gradient-descent/]

---

.large.center[How can we bring this successfully to the analysis of small-scale DM structure?]

---

# The traditional analysis

.center.width-80[![](figures/traditional_lens.svg)]

## Challenges

- Hyper-parameter optimization requires performing above analysis on a grid, to
optimize Bayesian evidence
- No source uncertainties can be reconstructed
- Very hard to incorporate additional components in the lens model (multiple subhalos), since posteriors become highly multi-modal

---

# Propositions

.large.center[Neural networks are awesome tools to answers complex questions roughly right.]

--
exclude: true

.large.center[Analysing subtle variations in hugely varying images is difficult.]

--
exclude: true
count: false

.large.center[It seems challenging to train and trust a single massive catch-all network to solve high-precision problems.]

---

# Propositions

.large.center[Neural networks are awesome tools to answers complex questions roughly right.]
.grid[
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/donkey3.jpeg)]
]
]

---

# Propositions

.large.center[Analysing subtle variations in hugely varying images is difficult.]

.smaller.center[Example: What is the length of the ears relative to the size of the eyes?]

.grid[
.kol-1-6[
.width-100[![](figures/cat.jpeg)]
]
.kol-1-6[
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.width-80[![](figures/donkey3.jpeg)]
]
]

---

# Propositions

.large.center[It seems challenging to train and trust a single massive catch-all network to solve high-precision problems.]

---

# Our strategy

## Step 1: Generate image-specific training data with variational methods

- Perform high-dimensional fits
- Component separation
- Hyper-parameter optimization
- Can be used to generate training data

## Step 2: Train inference network to anlayse dark matter structures
- Posterior estimation with targeted neural ratio estimation

---

# Marginal posterior estimation with inference networks

.red.center[Our goal marginal posteriors about properties of substructure in the images, like:]

.smaller.center[How to train inference network for $p(z|\vec x)$?]

---

# Neural likelihood-free inference

.center.red[Our goal is to train an augmented CNN to estimate marginal posteriors for parameters of interest.]

.bold[Starting point]: for any pair of observation $x$ and model parameter $\theta$, the goal is to estimate the probability that this pair belongs one of the following classes:
.grid[
.kol-1-2[
$H_0$: Data $\vec x$ corresponds to model parameters $\vec z$

$H_1$: Data $\vec x$ and model $\vec z$ are unrelated
]
.kol-1-2[
$(\vec x, \vec z) \sim P(\vec x, \vec z) = P(\vec x|\vec z)P(\vec z)$

$(\vec x, \vec z) \sim P(\vec x)P(\vec z)$
]
]

---

# Joint vs marginal samples

1) Examples for $H_0$, jointly sampled from $\vec x, z \sim P(\vec x|z) P(z)$

.grid[
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/donkey3.jpeg)]
]
]

2) Examples for $H_1$, marginally sampled from $\vec x, z \sim P(\vec x) P(z)$

.grid[
.kol-1-6[
.center[Donkey]
.width-100[![](figures/donkey3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/cat.jpeg)]
]
]

---

# Loss function

.bold[Strategy:] We train a neural network $d_\phi(\vec x, z) \in [0, 1]$ as binary classifier to estimate the probability of hypothesis $H_0$ or $H_1$.  The network output can be interpreted, for a given input pair $\vec x$ and $z$, as probability that $H_0$ is true.
- H0 is true: $d_\phi(\vec x, z) \simeq 1$
- H1 is true: $d_\phi(\vec x, z) \simeq 0$

The corresponding loss function is (so-called "binary cross-entroy")

$$
L\left[d(\vec x, z)\right] = -\int dx dz  \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z)\right) \right]
$$

--
count: false

Minimizing that function (see next slide) w.r.t. the network parameters $\phi$ yields
$$
d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)}
$$

---

# Properties of optimized network

## Some analytical estimates

$$
\frac{\partial}{\partial\phi}L \left[d_\phi(\vec x, z)\right] = \int dx dz  \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z) \right) \right]
$$
$$
= \int dx dz  \left[ \frac{p(\vec x, z)}{d(\vec x, z)} + \frac{p(\vec x)p(z)}{1-d(\vec x, z) } \right] \frac{\partial d(\vec x, z)}{\partial \phi}
$$

Setting the part in square brackets to zero yields that the network is optimized once
$$
d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)}
$$

---

# Likelihood-to-evidence ratio

.red.center[Training binary classification networks yield true Bayesian posterior estimates!]

With a bit of math one can show that
$$
r(\vec x, z) \equiv \frac{1}{d(\vec x, z)}-1 \simeq
\frac{p(\vec x|z)}{p(\vec x)} 
= \frac{p(z|\vec x)}{p(z)} 
$$

Once we have trained the network $d_\phi(\vec x, z)$, we can **estimate the posterior**
$$
p(z|\vec x) \simeq  r(\vec x, z) p(z)
$$

---

# Approach: Likelihood-to-evidence ratio

Assume that the goal is to distriminate parameters $z$ matching an observation $x$ from parameters that do not match an observation. We want to distriminate
- $H_0$: $x, z \sim  p(x, z)$
- $H_1$: $x, z \sim  p(x)p(z)$

We train a binary classifier to discriminate between $(x, z)$ pairs drawn from the joint distribution $p(x, z)$ and the marginal distributions $p(x)p(z)$.  This loss function is given by
$$\ell = \int dx dz \left(
p(x, z)\ln d(x, z)
+p(x)p(z)\ln [1-d(x, z)]
\right)$$

One can then show that the trained neural network $d(x, z) \in [0, 1]$ satisfies

$$r(x,z)\equiv\frac{d(x, z)}{1-d(x, z)} = \frac{p(x|z)}{p(x)}$$

which is the likelihood-to-evidence ratio.  Training only requires simulation/samples, and side-steps any likelihood marginalization.

---

# The inference network

.center.red[We use a very simple network structure (low capacity, fast to train).]

## Three conv layers
.center[$\vec x = \text{conv2d}(\vec x)$]
.center[$\vec x = \text{conv2d}(\vec x)$]

## Concatenating features from CNN, $\vec f$ with model parameters $z$
.center[$\vec f' = \text{concat}(\vec f, z)$]

## Three dense layers
.center[$r(x, z) = \text{MLP}(\vec f')$]

---

We used variational inference fit from above to generate training data for a neural network that can recognize individual subhalos.

---

# Strong lensing image analysis

.center.width-80[![](figures/NeurIPS_Fig2.png)]

The shown posteriors are **effectively marginalized over thousands of source and
lens parameters**.  Those marginal posteriors can be evaluated in seconds once
the network is trained (and training takes maybe 20 min).

---

.block.center.large[It start to work: Similar sensitivity to full likelihood analysis, but much much more flexible.]

---

# Spinoff I: Nested posterior estimation

The idea of targeted inference networks can be generalized.

---

# Spinoff II: Analysis of $\gamma$-ray data

---

# swyft: stop wasting your precious time

.center[B. Miller, A. Cole, G. Louppe, CW: https://arxiv.org/abs/2010.07032,
https://github.com/undark-lab/swyft]

.bold.center[We recently released a Python package for **neural nested posterior estimation**.]

.center.width-60[![](figures/neurips_swyft_fig1.png)]

- directly estimates any marginal posterior (in the above example, we never estimate the full 3-dim posterior but still get 2-dim marginal posteriors right)
- performance doesn't depend on the number of nuiscance parameters
- enables the re-use of past simulations

---

# Conclusions

- The analysis of .red[gravitationally strongly lensed galaxy images] enables to study the substructure of dark matter.
- Extracting information from these images requires .red[sub-percent accuracy] of the analysis.
- We developed a .underline.red[two-step analysis pipeline].
- Step 1) Source and main lens reconstruction with .red[variational inference], including uncertainties. This enables the generation of .red[targetd training data].
- Step 2) Training of an .red[targeted inference network] to perform substructure analysis.
- The approach enables to study the combined effect of thousands of subhalos, which is .red[intractable with traditional techniques].
- Lots of .red[spin-off work] (dark matter searches with gamma-ray data, gravitational wave parameter inference, ...)

--
count: false

.red.center[Check out swyft!]

--
count: false

.large.center[Thank you!]