Hello everyone. Good afternoon. Welcome to today's machine learning seminar by Dr. Francis Bach. He's going to talk to us today about structured prediction. He's a researcher at Inria and since 2011 has been leading the machine learning group in the CS department that ENS. He got his PhD in 2005 from UC Berkeley. And his research interests are primarily in machine learning, especially in sparse methods, kernel based learning, large-scale optimization, computer vision, and signal processing has obtain numerous awards. In addition to starting ground for the consolidator grants from European Research Council, he also won the Inria in research a price ACML test of time, our two types, as well as Lagrange price in continuous optimization and Moureau price in 2019. More recently last year, he was elected to the French Academy of Sciences. He's also active in the community. He served as program co-chair in 2015, a general chair in 48000 for XML, and he's now co editor in chief for Jamie. Without any further delay. Access the floor is yours. Thank you for the invitation for the nice introduction. So today I will talk about joint work with legs and back. Villa graduated yesterday and it is hoodie with a colleague at work off of edX. So today I'm going to talk about prediction and very quickly I will move. But the contexts, general contexts, I think I would assume the audience is familiar with the recent progress in AI in most lectures function tasks like vision or Duan texts, sorry, I put Translation. Your self-image positive seam is nice results here from coli bacteria. And those results obtain, think it's classic nearby combination of massive data to learn from computing power and of course, new models which are that T to the various modalities. And so at the end, the reagents and I put this in quotes. I think this is the only time with them too mentioned AI, the stock is a combination of flight models, algorithms, data, computing power, and what I'm going to talk about today, the algorithmic part. Okay, So this is the severe, my main focus and submit to talk about machine learning and how we can transfer, transfer those like learning problems into like good predictions. Keeping in mind that this is only part of the slide. So if you want to do nice things like what is what you see here or there, you need to use a lot of modern knowledge of the expert minutes. So this is for the general introduction. So the goal would be to do that as Machine Learning stop on terms of notations, we have observations XI, YI. I completed image where I can label on that image. And the goal is to predict the X given Y. And this could be done by having some punishment function to be real valued or this part of the talks. But we will quickly see that we may need to have more than a single value. 2. There are several ways of parameterizing this partition function. So the easiest way is to have linear functions, typically linear in the parameters, I guess so you have a feature vector of u of x and then the linearly combine those two vectors, sometimes that feature vector. So a feature vector is given to you, like in advertising, like a big sparse vector, but often you can build it by hand. And in many other situations, if you want to be nonlinear and your input x, you can either use a huge C of x or you can use the clumps. And so this talk is not really about how you can represent functions efficiently, going to morph goal of our different issue, but you can use condensate you want. But sometimes you want to use neural networks which are adapted to problems as a computer vision or an NP where you cannot only rely knifing a linear model in the parameters you need to add a sequence of in our model plus activation and so on. Okay, So it's a classical neural nets. And given those either neural nets in our model for the way you're going to do machine learning as this in this talk will be to do some regularized empirical risk minimization. What you have to minimize with respect to your parameters of your condition function. Some data fitting term with a nose, hair loss part and some regularizer. This is very standard in machine learning. And so the way we're going to The well-being to minimizing this qualitative data. So in this talk, I will not talk at all about optimization. All right, so just as a reminder, this is just a means to an end goal is to minimize the test error. So the, oops, to be the test and also the error that you have seen data. And typically you have to assume a link between the unseen data, your data. Typically the link is made through this distribution, which we assume to be the same at testing and training. I notice this is not true, but this is still what is considered. There are many interesting questions on that problem. Yes. So could there giving a talk on how you can optimize this problem? Or you can tell me how high you can, perhaps, how you can optimize this problem when you have a neural net. So this poses challenges. But I could also talk about the way you can learn efficiently with the neural net and so on. But today I'm going to focus only on what's happening. I'm going to focus on the last part. Okay, So I want to focus on only on the last part, what laws are used. And quickly we will say that they will require to see edge here or what edge belong to. So maybe we can go beyond, beyond, must not put this with the main topic of the talk. Study. What should be the good laws? And good, good way to parameterize H in more complex problem than benign classification. Without doing that as to maybe spend some time remind, reminding everybody about what we know about classification. Let's go to our first video, hydration. If you have a real-valued outputs. So typically a user scholars for many reasons, square. So the gradient is linear, so it's easier to optimize. And so this is a very classical, but what I want to focus on is on this problem specification. Okay, so classification, we consider data and minus 11 frequencies. Cassie calls turn down an isotope because it's difficult to prioritize functions whose value in Oman, typically people learn a real valued function h and then take the sine. So sine vector is this, which is minus one is positive, one is showing a tape which file you want a job, Okay? And you go home for real numbers to real numbers to select. Okay? So this is the, this is the way you can go from a real-valued to minus 11. So classical, and now you have to find losses. Okay, So what has emerged in the last like 20 years that if you take the product of Y here, an edge, Y is minus 11 an edge. It will use the side of age to predict if an edge have the same sign that you make no mistakes. And if y and h are different side, you make a mistake. So the WHO want us to the right so that once you report in papers in a functional product Y times the partition function for the Bluefin can tell which is, which is like not continuous. At this one, we see typically very hard to optimize and people have come up with other functions. So the first one is a square, like this one, square and a shockwave. And so this one is essentially saying that we're going to treat this as a regression problem. Okay? So this works reasonably well in practice, but this small part here, which is a bit annoying, you penalize good predictions. And people now using functions like the green line. The green line is one which is the logistic loss and I will prevent its justification in a moment. How the hinge loss or not. Okay, So I'm creating all of those results because many people would just say logistic and forget why would logistic. So my goal will be to remind ourselves of the super basics about education. In particular, where are the hinge loss comes from? And whereas logistic loss confirm because when we go to more complex problems, we may need to revisit those problems to first SVM, okay, So corresponds with the hinge loss. This is a classical support vector machine by nice old female, by adapting and seven kids. The idea that if you have linearly separable problems that are there, and the idea is to find the nasa piezo, okay? Which is here, this, this line. And that is you want to find a line so that if you project data on the line, okay, Then you want the largest possible distance between that point and the line, that line. And this can be re-frame in a very nice mathematical framework. Where are you going to look for the slow vector theta? So that for every i, the product of why I didn't our position function. Because someone and you want to make to minimize the scandal. So if you start studying machine learning like before 2005, importantly, I've put this in class, okay? And essentially that margin is in fact that distance, if you permit hide things, crises. But the CD for the separable case and non-separable case, if you start to have some like fair data point which is misclassified, do the so-called slack variables, okay? And you penalize the Slack and Zoom. Slack is positive as one. And if you do that, you see quickly that if you optimize over I, this is this transverse to these transfers to the max between 01 minus y i x i transpose Theta. Okay? So you can buy the class. But just to reminder on whether angels comes from, I don't know how to stop using. So the bit node by that. So I don't know how to do it. So live with it. Places a classical the hinge loss. So geometric argument and now that logistic curve different, typical justification where you're going to use Julia and the model, the model of P of Y equals 1. Given x, given x, which is a sigmoid cases we see dysfunction between like 01, the sigmoid or the operations function. And then you invoke your classical maximum likelihood principle, which is maximum, maximize the likelihood of a goal, minimize the cross entropy loss. And each look at each of these formulas. So you take the y i for which you have one, takes a low probability, one for which you have minus one, there's a low probability. An issue. Replace this by the values we undergo, the logistic loss here. And I won't do the computations is really important because it's just 25, just educations. And as far as we're concern, that corresponds to those two those two cars, two lines. Okay. Those two are now says these policies because they look very similar, but they have different behaviors and this is the goal of that slice. So if it should take, I added it says like 15 years ago, a very nice pattern at equilibrium Jordan in a paper by Zheng, He can relate behavior of minimizing the logistic or the hinge loss can lead the behavior of the minimizer of the theories. So this is the so-called seriously sees what you minimize when you do, when you minimize the logistic us minima conversion. But the aim is to minimize this and you minimize this one. But what you care about is minimizing the test error. Okay? So this image millenials. You take the sine vector here to obtain, to obtain something and value minus one or one can see the way we be pretty towns. And the hope is that issue minimize the serious very large space. So even being your network when we use kernel methods, you hope that if you have sufficiently many data points, you will end up sodium, which is close to the minimizer of this, of the old functions. There's a difference here. But minimize them are all functions which are meant to be Fisher consistent if in the limit at infinity data points. Our predictor, which is this, happens to be the optimal one of Ottoman one day, the one minimizing the test or after especially cases, often is a difference between when you add a classification before special name, this is G. And I've tested should in the Hopi that threshold, the output of your problem, then you get the optimal behavior. It turns out that for all the Convex, Convex surrogates the conventional sense, this is true case of the super chaotic color. This is true for many losses. So this is a first good thing. But what you want. Com, what you want tx ty that you want high, that if you get issue, find the predictor g here, which is close to the optimal one on there. Since the hawk goal of optimization to find g which is close to the optimal one in what you minimize that EDCs close to the 0 if this is small, okay? You hope that this is small as well as you see the excess risk is a base, right? The best you could do. And it says a deviation between what you can achieve, which is this minus the x-axis. My mother, the optimal one. And so this, if you have such inequality, this is more pieces. This is small, so small. And so this is a classical justification of using binary. Like two binary loss is like logistic unhinged. And now you can show it's a nice exercise for the hinge loss, the transfer function. So the way you go from excess risk to the egg cells, your own risk is just like identity. So it goes down as one over n. This goes to one over n, So there's no loss in the transfer. Whereas if you take logistical square loss, you have a smooth function. So it is easier to optimize. But when you surf on the theorist to the risk you lose a square root. So square root make things bigger when u is 1. So you get smaller behavior. Classical tightrope between smoothness and transfer function. And the goal will be to see what we can say if we have, if we go beyond binary classification. And so before I go to beyond manner, maybe I can take some questions on this part of the talk, mostly like a review, but if you have questions, feel free to ask now. I can go on there. No questions so far. No questions. Okay. The mark locations for now. So now we can move on to the so-called structure prediction remarked. The idea is to go beyond the classification and regression. And what do we have is an output space which is not. So we had permanent y was minus 11. Okay? So now we want to go bigger, bigger, bigger sets. Okay? And this in the examples can be sequences, you can divide, it can be anything in the context of language translation can be sequencing and so on. And you kind of have a non-linear geometry, okay? So you can have that distances between various elements which are not 0 or one. And this is embedded by some loss function. Essentially tell you that if Y is ensure that you, as this one is the predicted value, then you see the how much you pay, okay? And this is given, assume given. Police comes with many examples of this. I two black too. By now, many, many in all areas of application of machine learning. So this is a protein folding thing. So you start with a sequence of amino acid, okay? And you want to predict the 3D structure of the protein. So the output is not only the location of a single, single amino acids or the location at the same time. And clearly the universe has a close depending on like depends on the biology that you want to relate. 21, which I like a lot because all the way from a former student of mine is, say you want to align, score musicals, cannot, you know, people, people pleasant scar on the piano, hair and then complete some audio features like spectrogram here and you want to align those two. Okay, so here, the intercept y, the set Y, That's why the set of all possible alignments between that time and that time cases. Typically this is done by dynamic time warping. Need to learn the metric for dynamic time warping. So S is one example where the y is quite a large output space and you have some loss functions which can be non-zero. So here, just to give examples, but mostly to consider a general framework. So this is not new. This has been like keeping the machine known as busy in the last 20 years or so. And now a few existing general frameworks, and one of them is called conditional random fields, which Trump died extension of logistic regression. Okay, I won't talk too much about it today. And the one I want to focus on that homework, okay, which is, which dates back from a momentous CA, and others like tetanus. And so there are many frameworks which are dedicated to this, okay, If I had that problem, okay, Then how can I, what can I do? How can I avoid trying to toss from a Polaroid to a sequence of binary classification problems. This is a goal of that area of prediction is to avoid partitioners too big. They're all happy way of transforming that problem in setup and I decisions can go is to avoid doing that. There are so many works on special cases, so many watts. So for ranking of multiple labels, of course, multiclass case. So I won't talk many. I won't talk a lot about those patient cases, but this is clearly this has been looked at a lot. Okay. I'm going to see it on NMR. Not the most high point of view, but we want something that works all the time. Okay, So if I give you a new problem, that form, what do I need to not be able to learn from it? Back to deceive the goal. And there are many interesting open, not open research questions. Is you want the source files. You want to make sure that the design does functions, but we'll extend the one that you see from an AI gasification, which are charge of all, okay? I want it to be computable. Of course, you want to preserve the nice aspects for binary classification. Consistency. If I have more and more data, would I predict as well as the best possible classifier, my class. And of course, you want efficiency in the sand that I want to be able to deal with problems with exponentially many classes. What do I mean by that is if I take the set of sequences or DNA sequences of dimension m, this is y is equal to that. Then the cartilage of y is 2 to the n. So naturally, in for many discrete objects to the cardinality of the category of y is exponential in the natural dimension. Okay, So as you say, I'm going to credit quiz potentially exponentially. Many outputs have to be careful and you want to make sure that when you provide guarantees, guarantees are compatible with these exponentially many classes. In particular, whenever you have something, we're going to have the kind of Y given N. N will not be allowed because it is exponential. And whenever B, because this is a goal of Today, I'm going to, today. So for this begin to extend what we have seen for binary classification. So just as a quick reminder, for binary, we took G to G in R and f was sine of sine of g case. I'm going to need to do this for something a bit more complex. So let's look. Welcome again to have some loss. We're going to have some, some loss function, okay? So again, so y will be, y will be the true value, lies with the predicted value. And we're going to assume that y is finite. This is a huge loss matrix, big matrix with all losses for all pairs of potential outputs. And then it's note that the optimal partition function for the Bayes optimal classifier. So it's known that on the lab, I really like that. You can essentially decouple the x for every inputs. Can treat every input supply chain. They only look at the distribution of y given x. You're going to select the zed. Would you expect it is as small as possible? Okay, That's it Just by just doing 4-bits engineer Fubini's theorem. Okay? This, writing this talk, the expectation of y. Given x, assuming x and y are just the data generating distribution. This is the optimal thing that you want to achieve, okay? Pc what you want. And the difficulty that does not mean of our y. And y is your huge grace as they're all like. If y is minus 1, 1 to the n, This is going to be quite large. And the GoToWebinar to introduce some vector space on which you can optimize. Because if you want to parameterize your output function, depression function better have a little space and the output always difficult to time and time. And this is why we need the notion of surrogate loss function. So the function will be a function that takes as input. That takes as input, you're, you're able, okay, Y as before, but now also a vector, as in the binary case, this was just a real value, k equals 1. And knowing when to minimize, minimize the expectation of S of G of X. Oops, I need to minimize that. Again, minima to minimize the expectation of S of Y and g of x. Since guy, okay, and this is what you optimize your learning algorithm. Since it will be depend on the value g, Typically Convex, This will be easy to optimize the optimal partition function. So if an infinitely, an infinite amount of data are you going to do the, do the same thing as minimized, find the best possible Vi k, which minimizes the conditional risk. Like that. This is the definition of the optimal partition function. Of course, what you want is that you want to go, want to go home that G star, when she is in our care to that star will change in y. So for that, you need to go home, okay, to y. This is how we need some so-called decoding function. So this is how you will go home, okay, to lie. And so if you take your G in the decoder, you go to some troubleshooting. Why? We see the so-called decode. So person in this area of correct prediction is needed if the loss function comes first or the decoding function consists of an issue work, Let's say you want to learn a spanning tree, okay? So you can say, Oh, my decoder will be maximum weight spanning tree. The parameters will be, what do I put in my weights of my graph, okay? And so you can see most of the time, most of the solid prediction can be applied. If you start decoding and then you want to find a method and a good loss-function at decoding. In this talk, I start on the loss function because it's something I'd prefer. You can do the opposite, crowdfunding decoding and then find a good loss-function high. So what are the goal? The goal will be to find I'm charitable S. Okay? So what I mean by tractability, you need them, you assume S is convex was page and the second argument, k conveys respect to v. To v then will allow us to do optimization. Of course you want some efficiency would reject otherwise kinda useless. And we want that decoding function to be so efficient. It's not that any of this is not present this evening to be, to be. And then you tractability. But you also want some ghosts, ethical behavior, consistency and hopefully exquisite constants. You want to make sure that, if any, that big, then I predict correctly within bounds which are not linear in the size of y, otherwise going to be two labs. Any questions on the goes on, then no question from the audience. I had a quick question. Initially, are you going to state a result that says that given any loss function, you can always find a surrogate loss plus the decoding function is less likely it is that there are some results and I'm going to buy a car. So the first solution, which is quite popular, output support vector machine. And the, and the, it starts from the decoder to have the bladder size of wine, you have one potential value for every, every output. And the idea that you cast the decoder as max of max of y or BY. So you're going to be happy if v y is bigger than that for all z, so on for me. Okay, So this is because your, for your data, you get a bigger value of view. Hi. So you want this, okay, for all that different from mine. So following the max margin framework, you want this to be bigger. So maybe you had one, okay, to be really bigger. There's, this may not be possible. Okay? So you add some slack variable like this. And if you're a place here, the one by seven, that will give them the loss between Y and Z. So you, what, you, what you want. Essentially that if y is close to that or if you have a small loss. Okay, VY is close to the z, but you have a big loss and a big loss you want really, that viewer is really bigger than that. Okay? And so if you do, you do this and you see again that c is bigger than v z plus loss of Y minus Z minus V1. And it should take, if you take the infant, see exactly get the sorted out. Yet. For the theorizing itself, I don't want to do it online. This is a very classical way of doing. This has some nice benefits, in particular, computationally. So if you start from a problem, if the problem or the Vmax is easy to do. Okay? So this definition problem with potentially many Ys can be done efficiently. Then these as almost the same structure. Okay, So you have a max over z of z, which will be max. Then why not? Why Cyber that? To get the maximum wine of the seminar structure, you have the loss which is on top of it. But typically if the loss as a simple structure, this as easy to compute out, this causes this homework made for computational efficiency, if Sure. Okay. And you have a loss which is nicely adapt to the decoder, then you can design that loss. Okay? Then you have a nice convex because it's a max. So max of these just to begin our Z. But notice that it's convex in this way because this is linear in this max function is convex in this side. So this has all my tractability requirements is why this became popular type of ten years ago. But the issue is that it's not consistent. Okay? Meaning that even if I have infinite amount of data, I'm not going to be consistent. This is true even in the simplest possible setups like multi-class. And it's consistent for the binary case because it suggests the SVM, but it's not consistent for the multi-class case. So if the 01 loss, meaning that if you're a call, we have no loss and it's differentiable, loss equal to one and can only be cos you stand if scherzo called a majority class is that if your labels, one of them at the dominates, I will put probability, okay? For the barricade is always true because one has to be bigger than the other. But multi-class, these are very specific problems that he sees. Why it's not consistent. For general those functions, we pull that essentially that it was not constant as well unless they're very practical or loss function. And so Nostoi shot nice, nice framework, okay? But not consistent and people in practice, I've seen some tablet, it doesn't have to do, to do that. We have worked on some cautious about education. Replacing this by something similar. See something similar and you can look them up. What I want to talk about is a generic framework which is both taxable and with good behavior. Right? So this is so-called quiet jigsaw gates. We see this actually work by CouchDB at all. The answer was at school and he has no journal. I'm going to simplify the problem to see if IQ loss functions. And we're going to assume that you can find a feature vector, an embedding labels C. Okay? He became CEO of y can be the same. K, k being quite large, okay, and you have an embedding plus some Netflix. And so does exist. It turns out that issue, willing to have k being very large and infinity, almost all loss function you can think of that form. So this is a very nice century. It's not clear restriction. Okay, So I can give you that lost and you would know that there exists some c and c sub c and n, which may be complicated, but for which they exist. What is it so nice to be like this, because if you take the optimal, this is the optimal, optimal predictor. Okay? Can check that by using these like bi-linear form, then you can take the expectation inside the, inside the product. And what you end up with is a minimization problem, okay? But the data only come dance for the conditional expectation. I'm sure embedding of the output, the outputs given y. Okay? So now the utilitarian work on, okay, is that you just need to estimate one conditional expectation. So you summarize, summarize complex problem into an expectation computation like this. Super cool feature that we don't need to know, see explicitly for that one. Okay? So if you see it's easy to see. Easy, easy as well. The structure bone. Okay. You can look at the data as I did with and I call it comes with nice, nice months. He says generic, nice balance. It's related to what we had with Julia, goes to add some graphics, I guess, but I think this one is much better. And so I'm going to work on not that Ombudsman is all nice, but it misses, it is missing one thing which is the PCB, the logarithmic dependence on potential capital Y. So it's balanced. But if you start to plot, the mother may not be getting guy. So what we have done first is with our students. So Alex Novak advised with Amazon all these try to see can we derive bounds which on more meaningfulness and that can be applied to a large sequence of problems for which we know that the dependence in the community of y will be to that. And our assumption will be simply the same as before. So we had to construct if we want to, convincing is Diane by adding a constant, I'm going to assume that the embedding as long, okay, So van can follow all your work on the sub cases by Hamas, Hamas by me and others. So we create a strip like this, it like commenting. And so this is first, this is true in May for modulus functions are very classical. One is the so-called like multi-level problem with a minus 1, 1 to the n. And having us, you count the number of insects having loss is exactly the L1 norm. Because if y, z is equal to z i, good 0. And his wife default subnet I. So one is one, the other one is minus one. You get a factor of two to the hearing loss is half of the L1 norm. Because why each component of Y minus Z takes value 0 to the L1 norm is nice and square them to not constant factor. And then you can expense credit to non usual like expansion of two norms. And because none of y is constant equal to L, we get that the loss function happens to be a constant minus dot product. So here a is minus 1.5, okay? Yeah, and sea of white wine. It looks very trivial. Very fact that we will never hedge for this one. And in the paper here we have many existing this kid losses. So NaCL and discrete objects we provide such decomposition, such decomposition. And the key here is that from that case of the hobby loss, why is, why isn't okay? So it's m, which can be compared to the one reason why you can learn business. Punch him in the class II that the loss function that you use on your loss metrics, how best to have to add luck and issues from that. Then what is learning algorithm very basic in that? Just it should observe y and for all those polymers, why? It goes to the squares y. Okay? So this is an alkane. So it gets like a K e squares problem, particularly simple algorithm. And then what we can provide, so I won't give details now, too long. Then you can gauge like knowledge of polynomial in k, square root of k over square root of gravity between like this for the excess case. Next PC dependence on all quantities. So I won't give details, but essentially, this depends on the knobs of the factors of the loss metrics. And we have plenty to some simple problems and compare to existing designs. Notice I did not necessarily simpler, but Sometimes simpler than were not consistent. So we exhibit situations where cost, the start method outperforms a consistent method solely. Because isn't always better. If an infinite amount of data you need to do very well. The best you can do as well. This is what we show is consistency. Okay? But we show that here are the datasets that we have tried and it does work a bit better. This is what the simple analysis and *****. Then maybe I will not prevent so much like there hasn't been too long. But you may be tempted to say, burnish, all least-squares occurs. So if you use least squares, we use the squares on the the embedding. And if we go back to manage that education is exactly why you should go back to education. Store that these squares is not ideal. I felt least squares on the label works reasonably well because of that. The path is the path where the product is y times h of x is particularly good. But you still denies it. Okay, So that part is responsible for things which are not alone. So there has to be, this has to be, has to be there has to be a reason why it's not. As in the binary case, okay, people tend to prefer so logistic, logistic and the hinge loss. And one just education can be seen from this optimal function g star. If I have infinite amount of data. What Layla, okay, and if, if what I have to learn is simple, It's good to be easy to learn. And if what I have to learn in the infinite sample limit is complicated, is going to be hard to learn. But let's look at cases per square loss, okay? It's known that the optimal, the optimal function is a conditional expectation, okay? So this is what you need to learn for least squares, the E of Y given X four edges, you can show that you have to learn that the sine of y given x. So this means that if P of Y given X is a nice function of x, Okay? I say smooth. Then if you take the sign, you can go for something like this to something which is but she's like something like that. Okay? So it's not that when you go from smooth to non-smooth functions, things that might be a bit harder to hardware up to now. So essentially in statistical term, depending on the loss function that you're using, the approximation error may be, may be bigger because dream to learn more complex functions. So it sees him sometimes it's hard to learn. And logistic integrating in-between because it essentially takes y given x, okay? Which is this one. Why is it minus one? And so this act and jumps will simply take minus 11 and do something like that. Okay, so it's Sessions essentially the log odds ratio. Sometimes it's better behave in particular for I use, which includes classical justification why logistic can be better, because the approximation error is typically easier to deal with. And this is what I tried to, to do. Nice benefits of logistics aspect to arrange of square. It turns out that in the context of our problem, we can also extend the logistic loss to, to, to try to partition gets. The idea is very simple, okay? And I'm going to be quick. So a surrogate for square loss is square loss, okay? And we realized that there is like a quadratic function there. And following, following produce work, you simply replace the the quadratic function by another 20 convex and smooth function, smooth convex function. And then this, this allows you to, to do what you want to for multi-class, we use the max function at some of the young face, essentially what was happening. But this can be extended to all the stations are described. You can design functions edge, which are shadow, would be okay. And it's not. So you can just fine. Let's not talk about that. But She's like, well known from previous work and we showed that follows those, all of those are consistent. Okay? So we can extend those concepts as long as those two things like conditional on fifth one, the detail. But I want to know more this theta. And to conclude and then I'll take questions. So my hook in the stock is present. The work of eggs and meat is also to highlight the fact that when you add a rotation problem, there is a benefit to grow beyond like a sequence of binary certification problem. And that consistency is one guiding principle which does give you some situations better results. It's not only for, not only for learning theory, I think this is can be made practical. Sorry, but you can learn from exponentially many levels if you have some structure among the name of a low-rank matrix. But what matters the most is those two paths, okay? So for those methods, typically by using an embedding of c of y plus some convex function, S of V and V being a predictor. So this as what you put in G, okay? It can be a neural net models, can, nobody really cared. So whenever the event issue, I would like us to draw big life beginning our net. You can replace the last layer, which is the concept of pillows layer, any of those, of those methods. And at the end, you can do the gradient descent because we have made sure that the community has made sure that this was transferable to the six, really something which I'm so pleased to see mong, why people keep on using the super-simple **** it. Then last moment, some extensions. Okay, so sorry, I've covered most g like the simpler situations, but more and more people are willing to be the so-called interpolation regime where there's low noise in the data close to the close to a deterministic function of Y given X. This is Jessica as well. You can extend nice results by Seebeck of energy. So Kaczynski, the situation as when are you going to exponential convergence, which the provision, when you start to have y, which is complex, you can maybe difficult to have the whole Y given to you. Miss some sub-portion of y can be given six weeks provision. So there's Jeannette, if you'd like a band was worked on it. Of course, modifications of Lean. And, and the last, last bit is, as extensions, do we really care about or the loss minimization somehow justify my techniques by saying I want to minimize the loss of x. But when you do logistic regression, you get a classifier. Producer gets some uncertainty, estimates of probability. And what interesting questions to see whether it can be extended as well to the total case. Both good laws, those estimated the uncertainty. And on this, I will thank you for your attention and take questions. Let's thank Francis for his great talk. This one question from Gaussian. She says, very interesting talk, just have a clarifying question. The decoder is known or given, or is it learned as well? So it depends. Okay, so if I go back, so, so if I go back there, so either you start from a decoder, okay? And then you try to find a camera that will equal to 0, or you fix the loss in quality framework here. Okay? The framework as a decoder is give it to you. Okay? This is, this is, this is a decoder. Okay? So you learn to estimate this. This is your decoder, Okay? So z sometimes is, sometimes it isn't. The paper with that. This is cryptanalysis. We exhibit cases where that happens to be very simple, okay? And so now this is not learned, is given to you by the Church of the loss function. This is where it will change the loss. Okay? You change a sea, change the color. Interplay between decoder and loss function is 0. Whereas this is why I said that sometimes the decoder conserved and the loss is found to be compatible. Okay, So this was the case was the case for the VM? Yeah. Yeah. So the color was given was given as given first. And you're asked to find the loss which is adapted so that this S is easy to compute. So this is really like one typically comes first and the other is deduced from it. I had a follow-up question in a non-quadratic case that you talked about, what is the decoder? You could either say, decoder is decoded in the, just the same. But the decoder is going to live soon because I do the same. Okay, so you have a similar, similar structure and CY, virtue simply input some different amendment there. And if you think for our education, whether you lose lag, quadratic logistic loss, you take the sine vector, okay? And so this is 70 corner. You learn, you learn this defensive. I have another question. So you said one of the assumption says that the embedding size guarantee you right? Yeah. How strong assumption is that? I'm not sure if you're Dr. Washington. Mr. This is toggled on. Yeah. Is it shrunk? Do we know that embedding in general or do we have to construct that? Okay, So for the nice framework of, for this one, the one I wanted to, someone you don't need to know they exist always. A nonbonding domain name can have in three-dimensions. Here's a Hilbert space, a method synonymous with space. And so you do it exists in bonding. Appropriate size is difficult. But if you take a problem or you know the loss per minute, submit the last matrix to have LOINC. Okay, So it is true if it's like, you know, you can find an embedding, okay? If it's not, then you won't be able to show nice results, at least with this technique. Okay, Then maybe other techniques. Although what those people have showed is that essentially the loss find that the rank of the final embedding is almost a lower bound on the, on the, on the performance. Okay. All right. Thanks. That's a question from ALL. He says that you can have many different representations at the loss functions. Has hi Otzi, a firefly? Results consistent with many possible to present? Yeah. Yes. We're essentially the All of them All of them are consistent. So for all of them, you can have, you will have you will have customs he wrote, what will be different is that what you will need to learn and function of x will be something like this. So this is your g star function of x. And this will have different properties for the 500. Why are there? So in the bounce, We'll always had the ability of that functional API in the entities. You can imagine that for some C's, it's quite hard Euro and from other cities not quite Haggadah. There is an optimal choice of C based on the archaeology of that function as a function of x. But all of them are, the bounds are consistent, valid. And you can take the ends of the decomposition and get the best MT, common practice, typically, for most losses, these are bidding, you have a natural embedding. The paper. We have a big table where we showed the sooner for the physical lawsuits. What is your meeting? You can have several of those. Nacho typically. Question from Hassan. He asks if there are specialized results for the case when the decision set is a more general structured set given by a polytope. This can be helpful for the decoding. I think in terms of like where are these completely cover the impact is on a hot tractable. This is cancer if you have a polytope, okay? This would be that the polytope and this will be, is liberty is, it could be an influence in terms of January edition bond bound. That's all the questions that we have. Let's start. Thank francis again.