Welcome everyone to the machine learning seminar today. Today we're very fortunate to have Professor at the grandson to give the seminar. So professor as a grandson is well, well, well renowned expert in machine learning. And so work on many topics. Though I have no professor Arthur many years ago from the MMD statistics work and was fortunate to have some discussions before. So that's the formal part of introduction to the former part is Professor Gretchen is with the Gatsby computer, computational neuroscience unit and the director of the Center for computational statistics and machine learning at UCL, University College London. He received a degree in physics and systems engineering from the Australian National University. And a PhD was Microsoft Research and seeing the processing and Communication Laboratory at the University of Cambridge. He previously worked as an MPI for biological Cybernetics and had the Machine Learning Department of CNU. And his recent research interests in machine learning include design and training of generative models. Both Gain Tide and it's places such as exponential family and energy-based model, causal modeling, and nonparametric hypothesis testing. He has been associated her for her many journals including attribute transactions on pattern recognition and machine learning, machine intelligence from 2009 to 2013. An action editor for journal machine in your research institutes. Just on 13, a senior area chair for Europe's interests on 80 and 20, 21, and a member of the Royal Statistical Society research section communities since January 2020. Professor grandson was a program chair for AS deaths in 2016. Utero chair for ICME ultrasound AT and workshop chair for ICME out 2019. A program chair for the daily workshop in 2019, and a organizer of the machine learning summer school in 2019 landed. Okay, so without much further ado, let's welcome professor as a grad and give the talk today. Enqueue. Victor, thank you for the kind invitation to speak and hopefully I'll be able to come and visit in person before too long. So just 1 is that I like to get questions while I give the talk. So please don't wait until the end. Feel free to interrupt me if anything is not clear and I'll do my best to answer. So the talk I'm presenting today is on generalized energy-based models. This is based on an, actually a paper from earlier this year with Michael adult and a young girl who has shown here on the picture. Okay. And so the theme of this talk is training generative models. So what I mean by this is, let's say I have a collection of samples from some unknown distribution, p samples x alpha unknown P. In this case, p is a distribution of a quite complex high dimensional objects. In this example on the slide, rp is a distribution over images of bedrooms. And I have a goal. My goal is to teach a machine learning algorithm to generate samples from a distribution q, such that the samples from feud look as much as possible like the samples from P. So in the example on the side, on the right-hand side, I've got samples from Ghana in this case, of images of bedrooms regenerated are officially with the goal of matching as closely as possible the distribution P. Now when I'm training a machine learning algorithm to generate samples, I need a loss function to train over. Alright? And in this case I'm expressing this loss on the bottom of the site as a divergence between p and q, which I'm going to compute from samples. Now, obviously, when I'm learning to generate samples from a distribution, It's not that I want to generate one very, very nice image of a bedroom, but only one is that I want to sort of cover the support. I'm going to get pitches a bedrooms that are representative of all of the different kinds of drugs that I might be able to imagine. So this divergence should be a measure of similarity between one set of samples and another set of samples. And this could be malice. But what we're going to see in this talk is that divergence can be more than a loss. If I had a good divergence measure between one set of samples than another, I can even use that to improve the model itself so I can pick my bottle and I can actually enhance it using a diverges to make it even better than it was when I started. And this is the new material which will be a covenant. Let's talk. Okay, So the divergence has more than a roll of just being a loss. It has a role actually to improve the model itself. When you're generating new samples. All right, Now, just as a reminder of a classical and very famous generative model, this is the generative adversarial network. And this is going to be the starting point of the talk today. And this is going to also show us where the divergence is used in, let's say classical generative modeling. Okay? To add again, which I use to generate the samples on the slide, proceeds as follows. I have a generator. I'm expressing that as a little students here, which is generating a set of candidate samples, okay? Like these might be fake bedrooms are in fact dots. Then I have a predict function which is in the cytosol professor, which my student, I should keep my critique, I shouldn't throw it in the trash. And I should use the critic to enhance my sample. All right, so now the outline of the talk. First of all, I'm going to describe divergences as they used in God. Okay? There are two broad classes of divergences that I'll talk about. One of the integral probability matrix and the other at the feet of APMs are used relatively more recently again, speed of it, it says that the original dependencies using guns. I'm going to argue that guards aren't really using feet of agencies, even if they intend to do so. What they're using is a notion of divergence which is closer to an IPM regardless of what they start with. Okay, so that will be in some sense a provocation, but I'll argue this case. Once I've covered the notion of divergences, then I'm going to introduce the generalized energy-based models, which are the topic of this year's actual paper. Um, these, uh, like again, however, I don't throw my critique into the trash. Okay, my divergence as you enter the trash, I use it to enhance my model and all else being equal, I'm going to make the claim that using the critic as part of the sample generation process is always going to do better than throwing it away. And it's got information in it that be generated dozens. And if we use that, we gotta get better samples. Okay, So that's the, the two parts of the talk. So let's start by talking about divergences as used in God's. If I'm comparing probabilities at a high level, there are two things that I might want to do. One is that in some sense, I could subtract one from the other, p minus q. And if p minus q, in a sense to be determined is 0, then I would say that P2 are the same. Another approach that I could take is to take in some sense the ratio of PDQ. And if this ratio is one and p and q are the same. So these are two fundamentally different ways of comparing probabilities IV, I subtract one from the other, but I take the ratio of one to the other. Now these two sort of philosophical themes give rise to two different facets of divergence measure. On the left-hand side we have the integral probability matrix. So these are based on the notion of taking a defense a PDQ. And I need to somehow witnesses difference. So I look for a function from a class of functions calligraphic H, that witness the difference of p and q arm. And they do that by trying to maximize the difference in expectations. And the P and Q are functions. And if I do that in some plants are functions. And if this difference is 0 for every function in my class of functions, then I'm convinced that pink are the same. But obviously as it grows larger, P and Q becomes more and more different. In the category a feed of utterances, I'm witnessing a ratio of densities PDQ by taking the expectation of some function phi of that ratio. Now phi should return 0 when the ratio is one because and pink are the same. And that fee will have some other properties as well that are going to be useful to us. Notably, we need fee to the convex, um, it's also going to be lower semi-continuous. But that would be an abused for property. Okay? So these are fundamentally different notions of divergence between p. Let's start on this side very briefly. I'm going to talk about batches of divergence between p and q that are integral probability matrix two that are known otherwise cysteine and the MMD. And they differ by using different classes of functions. Let's start with the Wasserstein. It's a very famous notion of divergence. And this uses as the class of well-behaved functions. The functions whose digits normal upper bounded by one. That means if I take any two points, I take the difference in the values of the functions that those two points divided by the norm of the difference in the points at which we evaluate. This has to be bounded by one everywhere. So I can say, for instance, that my slope of my function can never exceed one. Here's an example on a simple one dimensional problem where I've got a set of blue points, my real points, a set of red points, my fake points. And here is a candidate witness function for this instance. Okay, For this example. Thank you. And we can see that the Wasserstein distance is have a very nice property. As Q approaches P w one is going down right from 0.82, 65. And this gives rise to a property which is much more powerful that if Q, the set of examples is conducting weekly towards P, the set real samples. And we know that w1, the Wasserstein distance will converge to 0. Do I express this as a hopeful critic witness? It's telling me as my right to purchase my blue, that I'm getting closer to the target. Mmd is another instance of an integral probability metric. In this case, the set of functions that I used to get eyewitness functions. So my functions with adults difference in expectation under P and Q are the unit pulse in a reproducing kernel space. So in this case, I'm using a reproducible herbicides for the Gaussian kernel. Once again, as key approaches Q, as long as I use a large enough reproducing Hilbert space, for example, db producing Hilbert space with the Gaussian kernel by MMD is going to decrease. Okay? So at Q approaches P my everybody's got to shrink down to 0. Once again, the MMD is helpful critique. It's giving me useful feedback as my red approaches my blue. So both Wasserstein and D have been used as critic functions and gans successfully. Which does, does make good sense because it is a good indication that it gets good feedback to my students that my, my fake samples are approaching. I will suppose when this is taking place. All right, So this talk is maybe going to be about the fee divergences and not the IBM's. But I gave the example of the IPM so that you could get an intuition of what a helpful critic ones, um, and so that we could better understand what people mean in practice when they use feeder valences forget, um, and it turns out that when you're using P divergences for gans, you're doing something closer to an IPM than to a feed of edits. So this is a foreshadowing of what we're going to get to in a moment. Now feed the variances. They cover a very broad category of divergences depending on the phi function that you use to Helen Keller with that scale chi-squared, like there are many, many functions among the federal agencies. So as a reminder the feet of it, it says that the expectation of a function phi of the ratio of PDQ, clearly for the PTA agents has to be defined. This ratio needs to be defined. So p and q need to have a density and be supportive. P should be within the support a few for this to make any sort of sense. For the more that one has to be 0. Because when Intuit we need to return d phi to be 0 and phi is going to be convex. And this is going to be useful later to us. Here's a famous example, the KL divergence. Okay? So the KL divergence uses Phi of u to the u du. And you can see with a very small amount of elementary algebra that the KL is indeed a few divergence with this V function because care, I just switched the expectation q here to an expectation over p0. Now, let's say that I'm doing again and I'm going to use the kale or the reverse Halo, the Jensen-Shannon or whatever. My favorite agencies, is that a good predict function to adjust ski and B2 have argued, and I think correctly that the field of agencies, in almost all cases, a very bad idea if you're training a generative adversarial network. And the reason for that is that if p and q have disjoint support, then the fee divergences are not useful at all. So here's an example. Here is p and q. In this case they're both, they both have densities on the real line, um, but the support if P and Q don't overlap to the ratio, a PDQ isn't defined. For this example, the KL is infinite. The Jensen-Shannon, which is like the symmetric KL is log 2. And this doesn't change even as I move my right around. Okay? So as I move my way around, these are constant. So this is not giving useful feedback to my generator at all. Like my generator is, in principle improving because the red is approaching my blue. But my criticals, it's yelling at me that you're wrong. Time showing that in cartoon form as this angry professor. And that just yells at you that you're wrong, whatever you do, and as you get the answer perfectly right. So I just gave about who argue that this situation of having to strain support will almost always be the case when you're in a high dimensional domain, where you're trying to create a distribution with a low dimensional support in that domain to creating images in high dimensions. That's a great example of that situation. Images exist in pixel space. It's very high-dimensional, but they only occupy a very low dimensional subspace at that very high dimensional space. And my generator almost certainly won't be able to match that support except perhaps onset. So that's 0. So my KL is always going to be useless as y. It's going to give me know helpful feedback. So it seems that we've completely rolled out using PHI divergences. And yet, you know, thousands of GAN papers out there seemed to be claiming otherwise. What's actually going on here? To understand what's meant when we're using feet of agencies in practice to pain. Again, I'm going to have to make a little detour and introduce the notion of financial you, which is going to give an alternative formula, the feet of variances. And this alternative form is going to be the foundation divergence you can use and practice to create again. Okay, to what is the essential job. It turns out that the Fenchel jewel of a function, of this V function here is something actually very simple. It's an equivalent representation of that function in terms of its slope and intercept. If I take the jewel of a function and I give it the slope, I take the tangent line of that slope and I returned a negative intercept of the tangent line. Okay? So it's the slope-intercept form of that function. And if my function phi here is strictly convex, then the slope intercept form is a unique representation of that function. So whenever I give the slope, I'm going to get a unique intercepted that slope. Okay? So this is the slope intercept form a fee, I give it the soup, I get a negative intercept of the tangent line at that stop. Alright, right? Now, if I introduce this notion of the Joule, okay? The slope intercept form a fee. It turns out if I take the dual of the dual, then I get my original function back again. If I've got a function which is convex that remember a few slides ago, I required to be convex and lower semi-continuous than this property holds. Okay, to take the dual of the dual, I get my original function back again. So why is this interesting? It's interesting because now expressed my function phi as a solution of an optimization problem at every exit which I wish to compute it. So at every x where I want to compute Phi, I solve an optimization problem in terms of the Jew form. And then I get back my original function. So it seems that I'm just making more work for myself, but bear with me. We'll see where this gets you. But let's look at an example. The KL divergence, okay? It's a KL divergence for your axes x sub x. And I can express this function x log x as the solution of an optimization problem where C star is e to the b minus one. This, if I solve that optimization problem at every x, then I will recover dysfunction. Okay? So that's interlude done. Now let's see where we can use this Jew form in finding a, an approximation to the KL divergence or to any feed avoidance. In fact, here is a feat of agents are completely generic off-the-shelf. The divergence expectation of phi or the ratio of PDQ. Now, I'm going to replace B, the solution of an optimization problem, as I've described on the previous slide. For every z, I want to evaluate phi at the ratio PZ QC. And to do that for every z, I need to solve this optimization problem. And solve it. For SD. For every z, I get a different f, which is a solution of this optimization problem. And if I sobered at every single z, I will recover my original function again. Though at the moment I've, I've kind of done nothing. I've just taken my fee and I'd express it as a solution of an optimization problem. What happens now if I constrain my f to be a smooth function of its argument z, they're not allowing it to change arbitrarily with a busy, I'm now saying that in neighboring z my function has to change slowly. Then I get a lower bound on my region of the divergence. The night restricting how can change with the. And if I do that, then I obtain the following expression. Okay? I'm older, I've done here. By the way, the reason I put this EPS F minus E2 V star of S, instead, I've taken this as this q here, and I canceled it to get the p here and the Q there. Okay? So I restricted my function class. So this results in a lower bound. The bound is tight. When the F in my class functions can be equal to the derivative of Phi evaluated at PDQ. If my, if my cluster of smooth functions contains the derivative of phi at p or q, then this term becomes tight and I recover my original guidance again, I'm assuming this ratio is defined. Of course, if it's not defined, then things get more complicated. Are there any questions about that? Okay, No questions so far. Good. So let's take the example of the KL divergence. Okay? That the KL divergence is vibration PDQ integral of a PDSA. I'm now going to use the Drew form of my KL to get the following lower bound on the KL. Okay? F is now a class of smooth functions. I've got epi f plus one minus e to the minus f, okay? Um, I re-parameterize by set of functions as rather than using FIFO, used minus f plus 1. That's just to make the expression that I got here consistent with an expression later on in the talk. So I've got f plus 1 rather than F, rather than F. But apart from that, this is just the UFO. Okay, so this is a lower bound on my original KL. When is the band tight? What the bound is tight when f is equal to the derivative of Phi evaluated at P and Q. In this case it's derivative, this expression minus log ratio of p to q. So in this figure here, I give the example of where p and q are both Gaussians with different means but the same variance. In this case, the negative ratio instance, the straight line. So if my set of functions contains a straight line, then I can get a tight bound here. All right. What happens if instead of the population, Thank you, I have samples. Well then I can just replace my population expectations with sample expectations right side got the sample expectation of the sample expectation under Q of e to the minus F. And then I can approximate the solution of this lower bound here. We're going to call this DKL approximate though about estimator or the KL divergence a, two, this is much closer to what people are using. When they say they use a KL divergence for again, they're actually using a KL divergence, which is a proximate lower bound on the actual KL. But let's look at how this approximate lower down behaves. So we've seen already that the original KL divergence is very poorly behaved when p and q have this transport, it just gives you infinity every time. But what about the lower bound? Here is my lower bound, EPS minus e q e to the minus I plus one. In this case, I'm using as my set of functions, a repurchase synchronous orbit space. So this is well-behaved functions. And I'm enforcing a roughness penalty, so I'm repaying that is functions. We're going to call this the chaos would be. Here's the example from before where I've got my P and my queue with disjoint support and the K-L, the lower bound or the approximate lower bound, is in this case not infinity. Because I required my function here to be a smooth function. Alright? And because it's a similar function, even though the ratio of P to Q is not defined. Because pink you have district support. Nonetheless, dysfunction, the scale function is well-defined. And moreover, as the red approaches the loop, the kale drops here from point a to 0.1 to two. This is a constructive, this is a helpful critic function. So indeed, as my radius is approaching my blue, my kale, my variational lower bound is the king nicely to 0 and getting helpful feedback to my generator. So if I use one of these lower bounds to train by again, then I can expect it to work. Okay, so the upper bound is vacuous, the upper bound is infinity. It doesn't tell me anything. But in this case, even though I'm minimizing a lower bound by lower bound is actually informative. My lower bound is actually telling me something about where the red and the blue are different and how dissimilar they are. Whereas the upper bounds didn't tell me anything useful it is Toby, think you're different and that's all I'm going to say. So in this case, graphically we can see that the kale, the lower bound, is actually behaving a lot more like an integral probability metric. Because I got this witness function, which is telling me where red and blue disagree, even when the ratio of red to blue isn't defined. So in a sense, these, these lower bounds are much more like IPMNs and much less like a fee divergences. There any questions about that? Okay. Yes. Yeah. So I'll just squeeze question here. So so far there's no question from the audience yet, but I'm curious. Stove, the cow is like basically a deal where your nose then find the regions. So you mentioned that there's a fight emerging. There's also the IPM type DO compared to other IBM's. What is that? What is or isn't considering KL instead of saying washes, seeing our optimal transport type of declarations from MD. Yeah, that's a very good question. And it's a question which I don't have an Italian condensing answer for. My argument here is that the so okay. If that let me rephrase that. If the goal is to train again, then it's not obvious to me why the kale should necessarily be better than B and a D, or the Wasserstein or an IPM based evidence. And to the extent that GAN training as possible with a few divergence, it's possible because the approximation you're using actually acts a lot like an IPM and not very much like a fee divergence. If it was acting like a few divergence, then it wouldn't be working. However, if the goal is to train a generalized energy-based or 0, which is the second half of the talk. Then the keto form actually turns out to be very useful because it has a relation to the likelihood which the IBM's do not. I'm sorry. Yeah, if the goal is to train a generalized energy-based model than the KL is a way to go. I'm giving a PV of the second half of the talk. Yeah. Conquer, perhaps. Wait. Okay, to one little warning though, when you're using the kale. Obviously, you need to use the caliph. Your training again, you can't use an ITM because an IP and it's not going to work. However, the kale does have a pathology or any of these variation of IBM's out, any of these variations of it. But it says if you're using kale or a Jensen-Shannon bound or brothers care about the chi-square down, whatever you like. They all have the following pathology. Okay. Which is the pathology which that they're not very informative about the clubs. What do I mean by that? Okay, let's look at a very simple example. On the x axis, I've got the KL. For the case of a pair of Gaussian with the same mean but different variance. Okay, so the red Gaussian is my fake samples, blood samples. And as I move down the x-axis, my fixed apples become more and more local to the mode of the Gaussians. So the variance of my red collapses. Clearly as this happens, the KL is increasing. So as I got the effectors, the colors increasing. Now, the kale, the variational lower bound, has a restriction on the log ratio of PDQ. It requires my estimator. That ratio should be smooth. Okay? So the issue with that, as my variance of my red collapses with respect to blue, that log ratio is actually going to become very non-smooth. Semi kale. My bound is not going to be able to track my KL, my original. Okay? And that's what's happening on the screen plot here. Okay? And this is using the ERK just as my classes with functions. So this is the problem because I'm insufficiently penalizing the case where I've got Mukluks. So an argument that's a challenge with these few divergencies, iteration or approximation to the field of agencies is very insufficient. Be sensitive to mode collapse so they don't penalize but collapses much do they necessarily should, compared with the original feet avoidance. Okay? So I'm using an approximation, a lower bound on a fee divergence. It's worth understanding what properties does the Oregon Health in a slightly more formal way. Okay. So one property might be, can we guarantee that the lower bound on the tail too has the property that it's non-negative and 0. Again. And this is a property of the KL. Is it still true if I'm using a lower bound on the KL, right? A trivial lower bandwidth, just say 0 everywhere. It's a lower bound. But it's not a very useful one. To, to ensure that it's properly. You need to require that your class of functions is sufficiently, which just like you had to do for IBM's. Okay? Um, so for example, if h is dense in the space of continuous functions, that is a, which paths with respect to the infinity norm. This is true if, for instance, you're using a universal reproducing than Harvard space. It's also true if you're using a in your class, which is the span values are affine transforms of your data subject to a few constraints on the parameters of those affine transforms. Okay? So this insurance that we have a rich enough divergence is 0 anyway. Thank you. Agree. The second property is the property that I think is most essential when you're training gans, which is that as Q converges weakly to pee. Can we be sure that kale goes to 0? Remember we want this nice property that as my red approaches my blue, my kale should go to 0 in a nice way. It should be a constructive critique function that tells me that she was approaching P. So in this sense, the lower bound is acting more like an integral probability metric and less like a fee divergence. And this is the crucial property of these, these variational lower bounds, that they are very much more like APMs than they are PDA valences. And here is, I think the key to revealing that property. He was the kale, which is minus FTP manage elements of dq plus one. Now what I'm going to do first of all, is to stay a vacuous operation, which is to add and subtract the mean of f, okay? Plus f d q minus f d q. Okay, so I've added and subtracted the mean. And then I notice because of the properties of the exponential function, that this thing here is always greater than or equal to 0. So that means I can take an upper bound, which is an integral probability metric, minus 70 Q, sorry, Q minus FTP too, I've upper bounded by Kale with an ITM. And if my calligraphic H passage Lipchitz I can in turn up or down that biodiversity. And we know that there was going to convert this week. The p was this thing goes to 0. So that means by Kel goes to 0. So when I feel that my kale acts like an 9PM, What I mean is that it's upper bounded by a very, very nice IPS, which is a 160 distance. So as long as this thing goes to 0, I'm sandwiching my kale down to 0. So this is, this is a really attractive property at these variational lower bounds. Okay? So that wraps up the discussion on the vertices. Are there any questions before I talk about generalized energy-based models? Questions on that page? Yep. So this property, by the way, it's going to be absolutely key to generalize any base models like this is like the, the essential property which we're going to be using. Okay? So let's now talk about generalize energy-based model. So this is the new work that appeared this year. Okay, let's remind ourselves again. We've got our generator is generating samples. We've been at critical capacities, rebuilt the examples it gives feedback to the generator, iterates until an equilibrium is reached. Normally, once the equilibrium is reached, we take the critic function, we credit cash, and let me just use our generator. Okay? Let's first of all remind ourselves what this generator is doing. Okay? So the generator in again, like in the academy, just generating images is doing the following thing. Is taking a very low dimensional source of noise. It's pushing it through a bunch of filters and it's outputting a very high dimensional object. But of course, the number of degrees of freedom is determined by this low dimensional source of noise that it's pushing toward resulted. So even though the output is in a high dimensional space, and ultimately it's covering a very low dimensional subspace of that very high dimensional. And this is an important property. And this is by guns work because, you know, images are either very low dimensional subspace rehydration space that this property is essential to get nice sharp edges and that stays. Obviously like learning densities in a very high-dimensional spaces is like a nonstarter that's never going to work. Okay? So this is the concept that one needs to keep in mind when understanding the, the, the generator process in again. Okay, so now let's understand at a high level what a generalized energy-based models can do it. So I'm going to give you just a concept that says, okay, let's say that I have the task, a matching, a distribution in a relatively high dimensional space which covers a low dimensional sub manifold in that space to, to give, be like something that I can plot on the slide. The high-dimensional spaces in two-dimensions. But the manifold is one-dimensional. Okay? Because I can visualize this very easily if my target space is 10 thousand dimensional as it would be in the images. Okay? So two-dimensional, our ambient space, one-dimensional sub manifold in that space. And the target distribution has the property that it should be putting a little bit more mass towards the end, then less mass in the middle, then more mass towards the end again. Okay, so that's my target. Now let's say that I've trained again, began, in this case, picks a low dimensional inputs Z, okay? In this case a one-dimensional input, and it pushes it through a transform. In this case, the transform is the correct one to get an x in a two-dimensional space phasor X direction to deal. However, it only covers a 1D sub manifold in that 2D space. Okay, this is what my generator has done. What did I get wrong? Well, the issue is that my eater distribution here is just a uniform distribution. And when I push it through my nonlinear transform, it didn't end up putting more mass towards the end. And this mass in the middle, right? It got the support, correct? By design. I allowed it to do that, but he got the mass. So the question is if my generated ID that if that's the best my generator can do, how might I correct the generator to make it give the right answer? The answer plainly is importance weighting. Okay? So let's say that I've got my generator. It's doing this. One I would like to do is to do some importance weighting on a generator to increase the amount of assets putting at the end. Decrease the amount of mass that is putting here and increase the amount of assets putting them. If I could do that with some energy function, then I would take my generator and I will correct it to better match the target. Okay? In this case, importance weighting is a meaningful thing to say. Because my generator supports and my target support matching, that I can define the ratio of the mass on that support in a meaningful way. Okay? So what's interesting here, I've got this sort of control. Okay? I should read that contour along the support of my generator, along these red dots. However, it might surprise you to see that confer function defined over the whole space. But that's a little bit weird, right? Because surely I should only care about the value of that contour along the red dots to what's going on here. Well, what's going on here is that in practice, the generator is not going to be able to perfectly match the support of the target. It's going to get everything wrong. It's going to get the support brown, and it's going to get the bathroom. So here's an example of that. My red is trying to match the support of my blue. It hasn't quite got there, it's got close, but it hasn't quite hit the target. The ratio of the density of PDQ is not defined, right? They only overlap on sets of edges, like they crossover here, here, here, here, and here. Okay? Nonetheless, I want to do some sort of notion of importance weighting on my generator so that it hits my target. And this is why the contours defined over the whole space. Because in this case, where my generator is close to but not quite at the target, I can still use this energy function which is defined over the whole space. Push the mass up here and push the mass down in the middle. The push the mash up in the ends, push the mass down in the center. So if my energy function is, is more than just the ratio of density of PDQ. If it's defined in a broader sense, in a general sense, then I can still use it to improve my generator. This is the notion of generalized energy-based models. An energy function is defined in a meaningful way when p and q have the same support, a generalized energy function is defined even if you have mismatch support. And I can use that to importance weight my red. Are there questions about that? So just quick question here because they're interesting. Again. Just by chance we went into important sample, very much so. Absolutely. In fact, you would hope that when p and q have matching support, that this would reduce to the important sampling setting, okay? Which it does. However, when pink you have mismatch support, then you're doing something different. And this is why this is what's new about this work. Yeah, alcohol, alcohol among us now, crosswalks. For example. Let's say s1 people develop a sampling basically a, you want to, you have a proposal distribution and then you want to convert that into the target distribution and to have this generator. And so for example, the classic method is the rejection accept. Basically, if it's a proposal distribution that is close to your target, then you should have a better performance, something like that. So it's, a similar thing is going to rise here. You mentioned something like a uniform distribution may not be a good proposal distribution or something like that. That's a very good question. And in particular, even though I'm giving like the intuition of the input itself practice we never wanted to import something. It's a terrible idea. Okay? It's basically because of what you are saying. Like if you, if you don't have a good name for this, if you don't have a good proposal distribution, you are going to do well. What we're going to do, which is a very, I think like a much, much better behaved approach, is actually to modify the noise into the generator in such a way that it has the equivalent effect of doing this reweighting is going to use Hamiltonian Monte Carlo in the space advice on the generator. And that is going to have the equivalent effect of this reweighting. And that's a much, much better behaved approach. Rather than doing like accept, reject. Yahoo given away a surprise that okay, Good, Very good questions. So let's see now in mathematics what we've just seen in figures. Okay? We define a model using two components, a generator d theta and an energy function e. Okay? This is a generalized model. The generator looks exactly like a gas generator. I take a low dimensional source of noise, eta. I drew a sample z from it. I push a Z through a set of filters parameterized by Theta. This outputs an x in high-dimensional spaces, which covers a low dimensional subspace, that high dimensional space, my x two and from q Theta. Next, I define an energy function over the whole ambient space E. I use that energy function for v Wade the samples that I've drawn from my X. Okay? I now need a normalizer to make sure my battles but denormalized, which is the integral of this energy function over q. And this is my, my target, okay? This is my outcome. Now is Q were to have a density with respect to the Lubeck measure on x. Okay, So if, if q had a density on x, wasn't on a low dimensional subspace, then I will be back into the standard AND based on selling and there's many, many, many papers on that setting. However, we're interested in the case where Q doesn't have a density with respect to the leg. Which is that the setting translated again setting. Now getting back to the point I mentioned, Yeah, I never want to actually do this importance weighting a pact. It's right. I don't want to try and be way by samples or do accept, reject. Instead. As we're going to see in a few slides, I'm going to actually get a posterior in my latent space and z, which I'll be able to sample from using h and c. And that will have the equivalent effective during this important trading. Rather than a potent waiting on my output, I'm going to basically sample from parts of my Z resulting in high dimensional samples x according to my model. So I sort of emphasize different parts of my z-space and sample more fun parts of my Z space with high-energy, sample less from parts of my Z space with no energy. And that's a much more efficient way of doing it. Here. In particular, the, the convergence result that I get, I is going to depend on the dimension of my z-space and not on the dimension of our space, which would be obviously very high. So by fixing would be very poor, or by rejections would be very frequent in the x space, but in the z-space it's much better behaved. Okay? This model sounds great. But how do I train my energy? And what an Earth this energy have to do with the critic functions that I just spent half an hour talking about. So I haven't explained any of that yet. So let's see the answers to those questions. Okay, How on earth will any energy to one way to fit the energy function? The obvious way would just be to maximize the log-likelihood of my model and rupee, my target. Okay? Which when I express my model using the form from the previous slide, gives me a minus the expectation of any GDP minus the log normalizer. Okay? Now, let's assume, first of all that the KL is well-defined. It's the KL is well-defined. Then this reduces to clear, well-defined. That means the ratio of PDQ is well-defined. Sre team to have density is a ratio of PDQ as well defined. Then this quantity here is a lower bound on the calc or the Nazca varied and down, which becomes tight when the energy contained celebration of PDQ. So if you remember from a little bit earlier when we talked about the kale, where we had the ratio, the negative of PDQ appearing. And that should start giving you flashbacks. So this estimate of the density ratio that's very well studied, massage this a Yamaha vision. Papers on data. Ca Canada has done papers on that. Island, has done papers on that. So you know that there is, this is a very, very well studied quantity. However, we are interested in the case in this paper where P and Q don't have a density ratio. Okay? So PQ have district support. They both have very high dimensional spaces. The ratios are well-defined. I might still want to pick E by minimizing this quantity even in that setting. So does this even makes sense? Okay? As we're going to see, It's going to make sense as long as we make some smoothness requirement on our energy function. And the precise smoothness requirement is going to be established shortly. Okay? So this is the generalized log-likelihood and this is a pretty sensible quantity to optimize, right? Because I'm optimizing the log expectation, the expectation of the log of my bottle and my data. Um, but you know, this log normalized that this is a bit of a pain, though we don't want to have to deal with that. So what do we do? Okay? So this is the previous slide, okay? Expectation log model, where I've got this becomes minus gdp. The snuck some extra vd cubed, the normalizer. The second quantity here is a pain that what do I do? Here? I use one last trick based on the convexity, the exponential, which is that I can lower bound this, not XP as this expression here. T is a scalar and nonnegative scalar. And so I'm sorry, this is a scalar, I should say, um, which is minus c minus e to the minuss. See that I've got the integral of e to the minus c, d q plus 1. This band is tight when the scalar approaches. As I trained my model and I learned by c, C is going to approach this normalizer. Okay? But I've made the sea the normalizer. Another quantity that I optimize over to using that were not strict, I get the following quantity. Okay? This is my generalized log-likelihood, which has the expectation of this quantity, dP minus e to the minus 6 quantity dq plus one, which is the K-L to the critic function from my KL they scan is a lower bound on the generalized log-likelihood, which becomes tight. I learned by C as either by offset. And because our offset is going to approach the normalizer. So if I have again, which is using the KL is it's critic than the function that I get. The critic function is the energy function of a generalized energy-based model. And so having trained by again, I didn't take my critic and I use that as part of a generative model to get better samples. So this is the most important slide of the talk. This is my critic function. I critic function gives me an energy. I use the energy to importance rate by 10 and then I get better samples out. Any questions about this as well hasn't gone to the soma. Yeah. I mean, see, I don't have any purpose for it. It might be interesting, for instance, like if you were interested in looking to get a model which is normalized and you wanted the normalizing constant, then it might be worth keeping it. As far as getting samples from again is concerned, or rather, let's say samples for again, or samples from a generalized energy-based model which incorporates the, we don't care. So like you know, although we care about is this pink up to the normalizing constant? Yeah. Obviously like in training we have the C separated out for the purposes of getting this sort of match between our GBM energy-based model and our critic. Yeah. Okay. Great. Yeah. So basically, you know, a trained by again, I use the KL critic. I take the energy function microcytic, and then I use that in my model as a, as a re weighting function. Okay? Um, I will talk about training the generator. The main thing is that if you trained your generator, you, you more or less failure like again. But I'm coming towards the end of the talks. I won't dwell on that. In a sense like the main thing is that if I have this KL critic function, if I train my energy function to convergence, then I get a valid gradient step for my generator. So if I print my energy function to convergence, then I take a step in my generator parameters. That's a valid step for my, for my generator. In practice, of course, we don't want to train alcohol all the way to convergence. We just train it. More steps in between are generated. So we take a lot of steps and our energy function, one generated step, lot of energy steps will generate a step. And that satisfies this property well enough in practice. But that's all set up. Okay? This slide though, I do want to spend a bit of time on because this is, I think the essential side. Importance weighting is a terrible idea, okay? So we want a method to sample from our model, which doesn't require us to do rejections happening, temporary waiting, anything like that. Okay. The way we do that is by modifying our latent space, that generator in such a way that we increase samples from z with high-energy and we decrease samples from z with low energy. So let's see how we do that. Okay? So this is our ensemble figure. I take a sample of eater, I pass it will be that I reweighted with it with this energy function. Now let's see the expectation of a test function. Under this end-to-end model case, I want to take an expectation of a function of X under that model. Okay? That would be g of p of z weighted by this function f b of z integrated over each of the two. Looking at this expression, we can see the posterior over the latent space, right? It's each z times FBE on B of z. So if I sample from that, That's equivalent to finding parts of my z-space with high-energy in my workspace. And so I just modify my z. I tilt my HZ distribution by this quantity here. And then I do HMC on this distribution. Okay, so I run Hamiltonian Monte Carlo and its distribution. And this amounts to something from the parts of my z-space with high-energy in my expects. This is very important. I'm doing HMC on a low dimensional space, not in a high-dimensional space. So it's very efficient and mixes well, it behaves sensibly. So this is extremely important to making this thing work in practice. And this is if you download our code, this is what we're doing. To generate new samples we draw from this distribution and then we just pass them through a generator. Nothing. Okay? So since this is a machine-learning talk, I need some pictures. This is a set of samples for a generalized energy-based model trained on CIFAR 10 will be sort of random from a random starting point at a low temperature. And you can see that the samples converge to move its different boats. Years ago she has a horse, is a car. You know, this, this, this approach, if I run my sampler, ends up sampling modes of the distribution which correspond to images. That's nice. I made a claim at the start of the talk. But using the energy function, all else being equal, always improve the quality of the samples over using just the generator alone. To this slide here illustrates that point. Okay? It's a case where we've trained again without using the energy function. And then we've included the energy function and the sampling. We've always got better FID scores. Fid is a measure of the quality of the samples. So lower is better, Okay? So 100 is just the baseline. And then these red bars, which are lower uniformly for all of these different image datasets. These are the quality of the samples that we get out when we incorporate the critic, when we include the energy function from the critic, which is always going to improve our samples over not including it. So you can see all these examples. Always does better as it should. And so, yeah. So it's working. Here is a fun thing as well. Since you're running H and C, you can sort of run the sample for longer and it sort of goes from 0 to node, which is pretty, pretty interesting. Tsl sampling over the posterior of the images that your model can generate. Homozygotes mode, then it goes down to another mode, that it goes down to another mode and so on. So, yeah, I think this concludes the talk. So we've proposed this generalized energy-based models. The key to this model is that they incorporate the critic as part of the generator process. Importantly, the sampling that we used to do this is done in the latent space of the generator, which has a low dimensional space. So we can do it efficiently. And we've shown that including critic information always improves over the generated samples. And we've got some code if you want to try it out. It's also, I'd like to point out some, some other interesting papers. So this paper here, you're guarded secret in energy-based model. This was really nice work by default cannula, where they were actually using the Jensen-Shannon rather than the kale as their way of getting their energy function. But apart from that, very much the same idea that you're using, HMC or some larger harvest diffusion and your latent space as a way of efficiently sampling from the world literally the energy function. These papers and I say well, are also using nozzle, a sampling and latent space to incorporate the energy function that the critic. So there's quite a few approaches using this idea. So I think it's a, it's a very nice approach. And that concludes the talk. So I'm happy to take any questions. Hey, great. Thank you very much for a very interesting and clear target. Now let's see if there's any question from the audience. Okay, so maybe I have a quick question here. I think this works really well on an H. So does this suggest this can be a pretty efficient generator for high-dimensional data. So what would you say, the complexity and scalability to high-dimension data? Yeah. So I think that's a very good question. In fact, I would say that the, the key purpose of this is that we are able to obtain samples in, in a sense like be the, the key benefit of this model is that density estimation in high dimensions is always going to be hard if not impossible. Like there are, there are very depressing theory results about how quickly one can do density estimation in high dimensions. So how pretty a density estimate in high dimensions? But this is, I think a very promising way of like learning model for distributions which have a low dimensional support in high dimensions. I think where some exciting work can be done is to establish Theory of these generalized energy-based models. To see whether it's possible to show that these models can converge according to the dimension of the low-dimensional manifold rather than the dimension of the high-dimensional space. And this would be very exciting. My conjecture is yes. It appears in experiments that the answer is yes. But obviously the theory that needs to catch up to the practice. Yeah. Well, also Could I was yeah. Um, but, you know, the the sort of important point is that we, we need to sort of start from the viewpoint that our model is not going to be able to get the support of the data correctly. To the mismatch of the data model support is going to have a 0. But nonetheless, we are going to be able to show that in some metric, in bar 16, in some integral probability metric, the, the target and the model are going to be close. And they're going to converge fast. And they're going to converge according to the dimension of this low dimensional support of the target. You know, that that's, that's B. They hope that I emphasize it like we shouldn't require that supports batch, like if be required at the supports batch. That I think we, we are asking too much predominant because I don't think I can achieve that. Okay. Yes. Yeah, that's great. Okay. I guess we're probably running out of time and at any thank you. From the audience. One question. I think to answer your question, as I said, Do we need equilibrium for their lunch when diffusion for such problems? What about non-equilibrium dynamics? Yeah, in fact, I would almost recommend non-equilibrium dynamics so we can run our launch of answer that it mixes well. But ultimately depends on your goal. If your goal is to explore all the modes of the posterior, then absolutely you should care that your method is mixing. If you want to converge on the modes, then you should do something a little closer to emulate where you actually like, drop your temperature really hard as you approach a mode and then get a really nice image out. Then maybe increase your temperature again and drop edge to get another mode and so on. To in practice, when you're obtaining samples, if your goal is to get sharp samples, then you really should not be running with good mixing. You should be running with bad mixing. Very good question. Yes. Thank you. Okay. I guess where I'm pushing to the end of the seminar on and let's give a round of applause to profess as aggressive. A very interesting, very illuminating. Okay, thanks a lot for making the time. If there's more question, I think a lot of us emails. Yes. Yeah. My pleasure. And I'd be happy to answer questions by e-mail as well. Yeah. Thank you very much for inviting me. It was a pleasure. Thank you so much. Okay. Bye.