[00:00:45] >> Thank you. Thank you thank you Professor Yoshio for having me here and if I can only introducing and a particular Thank you for earlier promoting me to early to new. Year and in the introduction and today I'm going to talk about a reason workers that are trying to combine classic optimization and a machine learning approach or so I'm I grew up with from the filter of signal processing and actually if you check my record my profile I got all my degree from electrical engineering so I'm going deep love with the signal processing and associated comics and the non-comic subhuman a sham project and then after graduation I become a faculty member in computer science and a warm welcome all towards into machine learning failure and as I realized there are many shelter methodologies and I actually if I have to say reinvention all the way also between the feel of a w n a c as if you signal processing on a machine learning so I haven't so far acquired the King about this the whole how much we could come by and integrate as a methodology between the 2 sides there and here is a example of data we've feel happy about it you know apart from years of research. [00:01:55] So let me start to motivate our work from a song phenomenon that holds most of people sitting here has suffered with before. Well. Now that we all know that helped him a nation has a contributed to the reason the prosperity of the machine learning the Great a machine learning models to rely on grade of the nation solvers to solve them but there is a dark side that behind the craftsmanship all the optimization other list the one that is you going to keep a promise Here's the full is all the theory to call it beautiful machine optimizations all over the really work practical reason here I give you some to give some examples that I believe you might be familiar with. [00:02:36] So even starting from the simple there's a regression solving you the grading descendant you have the true the right to set sizes and you have to if you use the last all other regular risers you have to balance their regular age and parameters and if that's even the simple case another more complicated case I choose the software packages that are used to before called a simple exam i.b.m. most of the packages solving the now programming and pointer programming I specifically checked there's a reason the reason the version all reference venue is that it allows the tones 221 pages long animation the tuning strategy is for 135 parameters and that's not the most amusing part that I'm most amused by the color on the 1st page and you needed to experiment with them though with all the promises that given all the suggestions offered by this menu which is true you have to think of by the best of luck buys all kind of effort. [00:03:28] And I haven't even up even the motion about our best friend that even our Yet I don't want to hear a city with a chance I'd go into training deep not a worker with no helmet or your Mini about your size learning rate of decay rate a regular risers back in the why they're in private or is that going to affect the thought of performance now if I even I read 3 of your eyes and help a pro and with all the providing a full coder I have a very very low confidence with reproducing the result that already and all others may be able to reproduce the difficulty just because he knows how this is a learning way to write so we always result the success of the machine learning contributed to from all the nations I think is in no doubt that the optimization is not a yet or fully self as a catered and fully automated for the complicated need all machine learning yet making us wonder is possible to optimize the list that he will promise a selection process for the specific purpose long and the data set at hand and that has been a researcher there for quite a while in the machine learning filter keeper probably the tuning although maybe the he program in a tuning and I consider that a very important aspect of all to m.l. in addition to models of action and other things. [00:04:42] So this is a selfless is a specific topic a were interesting but now I'm trying to extend this picture to a bigger one. Your margin or sense of the law you know a lot last page are we overview that given existing algorithms you want to achieve a better performance by tuning is a prime eaters so that it's a work of better feedback that better your program in the data the whole of knowledge of what it is you do not know the of the Malaysian solver all there is no all there so that's the solution for oil program to solve this here we see this in the there's a bio directional curve between of the nation and a machine learning it has been well known that from of the nation to washing and ironing this is a big peril and the big things that people work on Malk of better always for problems are growing complicated but I'm actually while being interested in this direction from right to left or I mean the Meanwhile more and more interested in the direction from left to the right of what the machine learning could do to optimization. [00:05:45] Well to answer this question unless the 1st of next year that we are not hoping for black magic everyone study multi Malaysians know about her the most the fundamental thing no free lunch is there room well in plan language and no free lunches there remain of the nation tell us there is no universally helpful universal or best optimal optimizer that a cow work for all the purple and all that all the personal and all the data input so you are not supposed to have that a black magic to get all time either that or walk all the way better than all the things in your hand but I want to point out one thing that can help us to circumvent this downside in the machine learning words we are never solve in every optimization problem in the word we are commonly generally always solving a specific class of the Malaysian problems to give you a few examples. [00:06:42] For example if you do a comparison ng you solve a lot of horrible with regard to random assume a random basis and that's even that it is not in general at all what the competition pro bono to say is not not not our general reaction purple that's even our very specific example of well what the combination problem with regard to random maggots magics that we've had quite a good condition numbers so you have to exactly that if we're taking that into account it is it might be possible to utilize the good appropriate of the good condition numbers of the good randomness that are trivial to use some Europe better all Grissom than just typical Iowa one minimization always them 1st all or 2nd all of us have graded. [00:07:29] If your intuition agree with me and another more complicated example let's consider the now works even our of the world consider well universal function operating mater so it sounds like a way of dealing with all the possible functioning the word but if you look out to look out our popular choices of food of all in your network it is it really is not the case perhaps not just a single The what a new ring you use most of the most of the popular answer is perhaps the rather rectified Alina unit and all its popular Garance that we call it a piece of ice Alina a family of neurons. [00:08:07] All the possibility of now work could be considered a cascaded a transformation between things only now transformation in the form of conclusion all media and the piecewise linear transformation. And is I hope it's not hard to imagine that if for example you are using Well Lou the non-energy. [00:08:28] The resulting segment of the solution space of the heat and visual space and well always have the decision boundary to be piecewise linear so your final classification boundary Well for sure without considering other modelers plugging in just a single read when your network a single example is out of the classification boundary for you because of who the hard to be a union of piecewise linear decision boundary then we should honest to tell ourselves Hey I'm solving classification problem which is the general the new with the general now Nino or whatever decision boundary and actually looking at a very specific a class of classes while we've always whose decision Pantry is a union of piecewise linear functions that have very strong prior knowledge or that yes can also help you bypass the limitation of no free lunches here. [00:09:26] So to better illustrate my idea I have uploaded or. Coded a from a very well written blog article from Berkeley out star apparently didn't mean that are so I I think this thing or give a good idea if you are using grade in the sentence which was orange or not developed or who is a guarantee only Infocom experimental now to solve recklessly non-comic solve the Malaysian problem so I could see if learning is issued in a work in many cases it is easy to get a slug a traitor saturated us doctor by all the bad local part of the landscape but if I'm not using analytical grade in the center but instead for trying to learn such an optimizer that it would have 1st there Travers a lot of work of the Malaysian landscape of a similar problem class. [00:10:22] My learning base the optimizer Now just think of a very intelligent agent that can travel or always a landscaper would be able to accumulate the knowledge of the landscape and now the landscape will become the environment the observations of the agent the solver it will learn Ok there are some repeatable low tone local landscape patterns us there always are some patterns a similar shelter in which if I go this step because I have done other many times that in the environment I experienced if I go this step towards it instead of following my gradient decision I would have a faster convergence faster dissin decent speed as it is they help me get out of the local trap. [00:11:05] The idea here is. Although analytical rule which imperfect the analytical rule and all the complex complicated hurdles or may get you into trouble way to fall off is a underlying assumption learning based optimizer might have the hope to solve it from another perspective which is I'm solving a specific class a subclass of optimization programs I'm trying to overfitting the landscape patterns over those purple so because I'm overfeeding just a specific class a lot in the nation rather than any optimization program in the ward I have the hope to take shortcuts that analytical rule my not be able to give me because that's the only patent shoulder by this specific cause of optimization problem. [00:11:57] So now I'm trying to give you more idea how to sync optimizers learnable because if you come from. Optimization Bagwan the like I was wrong in my knowledge the Easier to the easy to think of how optimisation may be learned. First that I'm trying to. That is a that comes our main thing today learning to optimize is hard to give up I mean general definition to learning to optimize it given that this is a quieter quieter and that's an area why is the earliest the papers are automated people primed to learning was a publisher the many many years ago the term learning the optimizer was even her like a 3 years ago in our clear paper and the didn't gather it initially didn't arise many attention but there's only the last a year it's a surprisingly blooms and with fewer walks of the neural searching eyes there Mel and you while you are few in the preprints their institutions get organized and I even saw mit this year open up a new class a called a learning augment it algorithm which I have been following closely so back to my so here I share my definition of learning to optimize I think is a generally refers to them as all a-g. all we use the machine learning approach they are driven approach to improve the performance of the classic optimization and as a coal methodology is to adapt their behavior to a specific class of problem interest and all are specific up of input distribution is by overfitting overfitting a positive notion too for almost all female all the major learn of all there are several machine learning tools that you could mimic of them a description of the of the Malaysian process now doesn't think of it charity evolved in Malaysia like a great it already was great in this and whatever you could have you them reinforcement learning process which are already used that the metal feat considering the carriage of arable us the state of all my aged and I'm always at taking my current data to predict my next action which is the update I want to apply to what my viral. [00:14:09] And I will receive a reward from my environment what is now you are meant to already stated that is an optimisation landscape and of the real world I receive there from that environment it is the last functionality if we're talking about a minimization of the laws of the how fast the last decade becomes my reward Similarly you could consider the eternity a process other Markov chain decision process or well depending on what all the algorithm user you cover different alters amount of changes and they could be also quite a similarly represented as a recurrence and you were a network. [00:14:45] And if you take a far more control post back over it it could also be viewed as a feedback loop system like a Manning fix the point of the Malaysian actually derived from the solving the solving of figure fit about the loop effect feedback loop system which we were next to the how we view a lot of thought from that existed last fall from that perspective and there are many more ways to view view this problem and there's all the difference of use a different learns or has also created a different a stream of all mastered to make much optimization problems of learnable by formulate optimizers as a reinforcement learning agent by Markov chain stumbling by we are current on your network by control of feedback loops and many more than that. [00:15:30] So being in this very blue this is a very blooming filled in and the Today I will try to only gradually the surface of this is blooming filled up by talking about a few specific personal rhythm words. I don't have to spend most of time talking about our work in solving lawful or sparse coding in learning to optimize our approach and because this is. [00:15:54] Where I used to I used to learn from my classes last only the call to the simpler self-immolation and difficult problems and the most difficult hurdle you simple division so we lived other purple on the border which are both of our laws to demonstrate some computational benefit practically And we're also the prove or some nice beautiful see already a result which we like better so that our Try to demo cereal bowls and how much salt how much you practically understory Cawley we could again for solving last fall using learning to optimize. [00:16:24] And my 2nd part of the toss I work in traduce our reason the work being general ability analysis of learn optimizer. Lytle I have already particularly emphasized learning to optimize the work because it's over fits all the way like the word what we lack of the word overfilled here because it's a browser where practical benefits will indeed also need to be a well overfitting means in certain cases a well the data help from instances or does it away from the assumed subspace all programs which is not always well defined that you have to suffer to risk your learn optimizer will fail and then your 2nd part I will talk about our work in how to detect as a failure and how to correct that failure and only if I have time which are heavily adult I will scratch more about us our learning of the mind opera applications we wouldn't hurt or like or we have applied to learning to optimize or install the plug and play. [00:17:27] Famous a technique or non-comic sir. Computational imaging particle of the part of the particle swarm of the Malaysian which we are glider to pertain dockings our bioinformatics application and the sun will work like a reason it were just a derive the 1st learning to all the only base optimizer to solve as we and it's accessible I think the 1st time extended out it to magic space optimization. [00:17:51] I mean even we don't have time to touch it as a work home to catch me after the talk of the homo discussion so I'll start from the 1st one. So let's start from the last which I suspect many people say here are more familiar than me so assume a linear observation model Well we have a measurement of Magic's a and our signal to be measured x. Are we sure we are Sumit as far and as crafted a further bias on idea cause and random noise so is the bell so many nice names in different field of electrical engineering machine learning and optimization it's a very fundamental problem with manning up occasions like m.r.i. imaging and the red and the red eye imaging so we love it many of our applications so last fall is a perhaps or not a considered challenging program by the self it have many of our some derived are from sub gradients there are the ones 1st all are 2nd all or even steven they are all there so your truth of the truth all this is was a signal recovery problem a popular way is to adulthood Isabelle's the optimization of form by our regularized at least as well well this lambda would have other well along with. [00:19:07] Pails and purity trade of and the euro to true to them by hand the question the bait the cross validation. With the many ways to solve this love this last allows allows a model today we were focused on one popular choice that inspires or for the backward fleeting specifically fall of old if it were derived into data formular in this form so each time I would get my current model or and then also this area are you ready to normalize the spectrum all is a transpose a so you can consider is that a constant or something if you choose or so times I read that you of the time the Times is a residue of the current measure with the part of the measurement multiplied with this a transpose the normalized by this l. and then this is summation goes through a Nanina operator which we call solve the specialty basically it will suppress the larger values a bit and there are outer certainly small barriers the small There are special and others where I showed it is a defined by this. [00:20:10] Lambda regularized as a parameter here so is our iterative algorithm and as we will know in literature 0 this algorithm will have our overall suddenly no convergence to the last a solution so if it were 1st to take it take a to suddenly know convergence to find that the correctness of hold the means the position of the non-zero element and then after the now there will suppose is a correctly identified or it well then under our faster linear convergence rate into the final solution so overall is up by a stage of convergence that with the overall being on the 70 a level now would. [00:20:50] I'm trying to play a few tricks or and gradually move right into this list East algorithm into a learnable system where we will start and where we were that you weren't really started working on by our. So 1st or is it her if you are with them and just like we were 1st that you are changing notations or to make our world easier we replace those with the be one the taboo to go your way replaces the law though the prism of the problem is a transport over and I manners of manners is a trifle a normalised by and lulls it into 3 prime minister doubly one tablet or 2 and the set up to help pass or just the cross make it make make a clear a better of all there was a promise a relationship one using force noun verb or x. y. z. in falls down count and the why is a threshold but so 1st that 1st things 1st I didn't do any magic here all I do here is a change of notation so don't be surprised I guess or not but there is one thing I want to remind here which we were revisit after a few pages this change of notation actually hide the potential structural relationship between Dudley one and the other 2. [00:22:07] So because the top one adapted to actually both just that depend on the same he problem to keep them same parameter a or a trans cold. If you're right of them into the blue on the double to you may easily forget they actually have a correlation with each other which are revisiting them after a few pages where again this knowledge are bad and provide us with a key simplification of this leap to model but we'll start from here Ok we'll just have one tablet and a set of. [00:22:37] 3 new parameters to us. And then we could write using those in your notation we could write this iterative algorithm into a feedback system so wish I still didn't change anything magically here I just take the same process and as in using this feedback loop to represent or to read here update a relationship so you can do that easily on your paper to. [00:23:04] Now NASA step so far I haven't a change anything I'm just a chunk of you nice notations there and they use a finger to illustrate that now I'm starting to change it. In the next page or because our record in the system if you want to solve it to the fixed point of it would still take you in fearing final loop to reach the Fix the point it. [00:23:27] Project all mastered that was not invented by me but invented a bike rather than the eloquent eyes and all paper almost 10 years ago proposed that the following approach. Instead of the solving the Fix the point or accurately I'm told this a feedback loop into a fine it learns of iteration and adjust their own for it and for data like now I'm having a fit a forward system and I just the cattle here. [00:23:57] This is called Our own falling under to Katyn trick that is the equivalent to our to a maximum iteration number 2 the iterative algorithm you want to run. And of them now I have a fit of order k. iteration Easter all say I have a case stage through the ford model then I no longer view that as a recurrent Anetta world but instead I consider the case stage further forward of neural network and isn't I trained enough for the data to the way there remains I could a generator a lot of the sense that a data from the observation model why it calls eggs Plus I could generate a random ads that getting a lot of y. and x. y. is a make my sense that a good training example I could have a lot of them and isn't trying the neural network whose input are sure to be my observation y. and a multitude of should be the target of solution the spot vector x. and isn't just a trend that will go from end to end. [00:25:00] That is a that is a call to the learned that is top reserved. For 1st appeared I used Al Gore's own view that as a recurring on your network and for the other 2 are fine a number of iterations of their views that fit a forward network unfolded and to cater to one and didn't try this to get a feel for the network using data driven and 220 like a regression is just a neural network or regress or in the last a step. [00:25:31] So that it was invented a pile of 20 years ago fossil actually much smaller goals than it is are treated now it was the 1st adjusting inventor to do faster rigor faster prediction of the past code for radio because the new video sequences are you typically would have a very similar income data frame so they just try to accelerate a faster coding base the video you're going to detection system by generating goals past whole official data is a is the start of a very practical work but then after so I hope I hope I'm not it so this is to me this is a one of the penury works and now looking from looking from now data driven optimization solver for numerous problems and from what is the deal by taking our engine out you turret of always I'm a $20.00 optimizer set and well those are the be one taboo too and the set up those of parameter all become a trend of all parameters of data so they are subject to tuning and updating buys all the 20 of them holding the regression program this become one of the earliest the worker in the model learning to optimize although it was not considered so by this is a learning tool optimized for laughable. [00:26:42] Alright but it is naive all. Then people of the start of feel I 1st started to realize that this is embarrassingly effective despite the truth of all 4 here is a result we get it from the original paper. If Or you just are Summa solving the lotto program why equals adds class with a unchanged but the accent change over time then by training versus over. [00:27:12] Faster new an adult was already on the generators and static data would have very easily give your regrets are that a convergence or magnitude of magnitude a faster than using the real last operable the last hours I'm here is an example so you can see from this problem we solve we compare the original paypal account to all the different orange are not used to our Prism with different lambda and as a least algorithm is a depicted by this almost a drop in water jumping like Compelled to the ether algorithm which are the police powers I'm would achieve the same reconstruction arrow on the same problem with. [00:27:52] This now what $100.00 times less of iterations. So it's all quite a it's quite a good acceleration and there's a solution quality is Imperial College good but 1st that I hope you are not of to surprised by this a simile magical acceleration because he's taught can be used to solve anything laugh or inverse purple with regard with not a not independent of a independent of x. and an independent event as other law is a physical laws a requirement Meanwhile this learner solver is a very heavy overfitting it can only follow the law so problems with regard to the same measurement matters a and of the access misfires imagine that the s. and or the noise vector Be sure to come from the same distribution so is a get of this acceleration by heavy heavy heavy overfitting only solving one problem actually. [00:28:47] And there's a 2nd thing I want to remind you is this really doesn't the scale well. Locos the way to do that a 10 years ago is to honor there's a recurrence a network of 4 k. to ration like a typical number Katie cost 16 then instead of getting one feedback loop with the ones that awaited Abbey one w 2 and the center the 40 the network up becomes 16 stage and they reach a stage where have its own probably one tablet and a set or are ordered from the same copy the quote is the suggestion was to treat all of those 16 stages that everyone tablet you and set up as independent privateers So you suddenly get a 16 times the mall parameters than the recurrent system so we could also say this is faster Greece Asia perhaps arrives from the heavily over probably to rise the network in solving an issue program is a something definitely like overkill you are using a 1000000 or so prime ministers just to learn solving one last or purple the how however well you solve it or it doesn't sound like it too exciting or at. [00:29:58] So and is it also begs a bigger black folks model and because the taboo on taboo to which your correspondent doesn't want taboo to which is a converted a from the orange and a magic now take becomes a real problem here is that after there's a change in it it's hard to interpret what is a relationship with orange in a so if you are statistical people that are trying to rely on laughable to do some future selection or interpret of all learning perhaps as this is not a surreal go on well it's just a tooth bath effect their prediction with while the orange in a measurement Magic's is already changing. [00:30:33] But so we don't feel when we 1st of star look at the problem we don't feel this is right why I have to have millions of prime ministers to solve just the one awful problem and the given law so personally the self is already a special class of all regression problem why I have to sacrifice me so myself so much in just a focusing on one day one measurement matters and also we are not happy we were not happy with the Sira lack of theory in this Imperial College very effective or facile awful so we do something that is summarized in our 2 papers last year so here we 5 finally started a where with we worked out our starting point. [00:31:15] You know our papers publishing that last 3 years in Europe than I clear though we are we propose up. I would say quite a 100 comprehensive a surgical framework of understanding why this acceleration is available if it can be attained and it can be improved so we can prove all the show that is where able to improve the classical access analytical spot algorithm by 1st that we can generate or the the least a network that is a guaranteed 2 generators a true path a solution rather than a lot of other solution which is known to be a pale the one and I were also able to reduce the total number of iterations that's about improving all the vision typically we only need a $15.00 to $20.00 durations that to solve to solve the Solve solved in the by solving the level loss of hurdle. [00:32:05] And also we improve the over existing learning base the recovery all of them including local and a lot of it's a successor by providing those 3 aspects of contribution surgical and practically. First them were able to show that in the training of Lisa could be much a simplified it is not an issue the not be treated general black box neural network we show that are more than 99 percent of the unfolded or further forward in your network has not touched no need to be learned from the data and you only needed to Arlen a few key people promise errors which is a step size and the learning rate not to reduce and learn so-called at least are faster regress or from a general purpose black box or even to just a data driven adaptive selection of step size for 1st of all there are words that at the mystified there's a linear regression. [00:33:01] And the 2nd thing this is the mystified explanation were proven by the by this data driven steps either is approval it improves the convergence rate or all 1st all other l.-y. minimization from suddenly Nia to Lena is approval then too and the commercials and it also gives an stability stuff relative conclusion. [00:33:26] The survey as we show that many while we are turning to the world of learning based the optimizers is indeed we indeed are not discarding the traditional wisdom in optimization filled many come to call techniques are developed of the acceleration amendable to our Learn to optimize the technique we provide approval for our classical technique Alcala kicking in our eyes Bregman to prove all improve our convergence rate as well so up up for our practical caught I want to use in addition to those already called game isn't our official code to try and there's a which is really sounds good however to now trend is a fast loss or regress or how to reduce the training time from of the course of 10 hours or to just the 6 minutes on the same g.p.u. $100.00 times acceleration other result of better theory. [00:34:20] So I'm trying I'm not trying to go through the key result and our logic or chance in deriving this so I'm bypassing the proof of details because the shortage of time but if you are Interesting that are welcome to discuss with me after thought so the 1st I'm a reset Homere doubt is where 1st that your rivals are single particle convergence and real condition necessary condition for law so you know if I honor the law so can the. [00:34:48] Originals of all in finite and in many numbers what is the relationship of a tabby one w. 2 and a has to obey as I wish I can find a number and here is our necessary condition. The stubborn one plus a taboo 2 times a hard to equal to one has to be asymptotically approaching identity Magidson and other thrash spouses are holding hard to be approaching how to be approaching the arrow while the last thing is the next the 2nd one is not a surprising given there's a fixed point a condition so that means a thresholding should in a work in the last part of the steps or the 1st one I hope is also reminding of your memory recall the tablet what is probably was taboo to stop you are not absolute You are both dependent on the magic day and the if you do are simple can your variables or by changing the other 2 e cause I minors a transpose a Abu Dabi one you cause a transport of this condition is there by replacing a transposer in the taboo to representation by the be one and he gathers and is approval is a necessary condition for us and without a conversion. [00:35:54] And we also show that are doing our will trying to behave the least of this as a methodical condition is a new more recall is satisfied even not to involve them in that way so after turning into ratios or the arrow between the 2 is already of both are very close to their this is the 1st thing we reveal the structure of the way to Dabney one taboo to it for the least how to get a conversion they are not independent too far meters are all to fully connected a layer they are heavily dependent on each other if the least one is to cause a conversion correct in arson politics and that's the 1st series our cornerstone. [00:36:34] Then stand that standing on this cornerstone comes away we get impure call inspiration which we found it works The following a nice way because we see that or this as impossible conditions have to be satisfied with decider we were in for so this a structure to keep an eye to work our regular riser and I consider that as a special way to shower him a candle. [00:37:00] So each of the layers or I would or directly eliminated the taboo to pry meters by using is a tabby one and a paste the representation so I mean merely hopping away another that our hearts are having now where the proper numbers are by room when we all might argue to my knees or Asian not only independent on this learnable matter is that of the one and in my prime and hers that emitted her a doctor by removing the taboo to prime the sizes so I'm now no longer quadratically dependent. [00:37:30] Or at the other sense parameters that help trying to a more stable that's good but that's only the 1st step. Second step babies are on the way to coupling structure we give where able to show that or show the convergence a guarantee for the least algorithm so we show that with the necessary condition all the Pearl of the weight a couple of and a few My other milder conditions are basically requiring the A's unusual coherency to be small and exit to be faster and some pounding the value of the ball available are you already you already assume that by the Congress in the Us Army or the we can show that or this area the convergence of the least Tyrell with regard to the number of iterations we are folded or goes leniently to the goals leniently to the truth pass a solution and there's a convergence rate others see here is a linear so the converter the. [00:38:32] Approximation aerial is Alenia to the error to a to the noise or to the noise a magnitude as well and the both a weighted up by 2 above by 2 coefficient is we want to say to see why is only independent of a as a condition and the c 2 is only dependent on the distribution of x.. [00:38:51] So the details prove or in the in our paper this is the so this is a so this is a linear convergence proof. And the babies on the lino convergence proofer were further tries or well given the original 1st of all there comes 1st all the of the Malaysian algorithm there are many tricks to develop the one where you pick the one famous treatise here to study it's called a kicking in linearized the brougham algorithm and the idea is in each iteration I were selected a portion of the largest component and adjust the trustor them as my truth boss element of I work directly possible the selector truth was adamant that to the Nexus is to ration without of going through solve the search holding so their value is not or attenuated in this wrong this is a trustable the verify strategy to even to accelerate or the occurrence of the largest the largest market used. [00:39:51] And the beautiful thing our framework is we show this a classical lean arises Birdman trick could also work in our least a case by showing improved or convergence Bond is the all the convergence it is still have a protein major area linear to the stand a variation the standard stand to the stance a standard duration of the calls and noise but what are the changes is the convergence rate or the exponential term and or the way to coefficient on the same amount or as we show here if we have a large enough k. the convergence of by our client kicking technique is a prove all the improving the speed or the lenient magnitude convergence level but with a faster fast constant the faster love account than the faster acceleration and also if given enough number of iterations of this coefficient of c.s.s. it is also smaller than the or in general see 2 so you are also expecting a smaller approximation Errol for the same problem by applying kicking. [00:41:00] And so now this is now basically that's what we presented in our 1st paper the newer paper and then we did after all after we get this paper down we're quite happy but it's still we don't file feel absolutely right about this now if we let me give you a recap of where we sort of where we are now we now have a reveal that the waiter probably was a taboo to in the fold in the network have dependency and they use that to eliminate a taboo to now we only have about a one if I have an 48 or 16 different layers now I have 16 doubly one together with a 16 layer each learning rate and step size that's all the parameters I have to learn now already reduced the from Lisa but the 1st thing we immediately discovered after submitting the newer paper is actually in our proof we do not a need to have the top we want to wait the only remaining fully connected layers in our network are to be dependent on k. we just find that if we remove this doubly one k. all just remove the care and make them all show the parameter all the conversion proof we knew of the paper still stand we just was too too stupid to discover the other people's admission so we just the simple external proof for to the shell the way to version and to use the one tablet doubly one across all the iterations that it simply changes nothing in the proof. [00:42:29] If it reduces the complexity formal with regard to k. 2 independent Ok because now all the layers of shale just the one tablet what matters. So we call it a title list are and are now we only have our complex it here with regard to the with regard to the w.r. matter the size of which ever is a measurement inside the m a m is a wife that was damaging and is as as damaging and as and plus a few steps size at the threshold and is just one magic but I do think this is the best thing we can do nothing yet I could even remove the last a potion of prompters from our formulation. [00:43:13] This is a loss as that of May result in a loss of program that will really give me myself some surprise. Them a result of our clear paper reveals the 2 facts 1st are the optimal way of Elise thought that it determines the convergence. Is not if you call it a read learn from data off it from the data the role of the learning taboo in our proof is that it can be learned or to transform the version with the lore mutual coherence So if you are familiar with the classic optimization that is very that is very similar to preconditioning a widening my measurement matches the data and is not the role of data to argue with regard to a is a just a created or other transformation of a that has lower mutual coherence. [00:44:12] So that it give us a final use of our ration. If I was after all probably one has to do why I have to learn taboo from data it has nothing to do from data it is just or it's just a need to have a low mutual coherence to a wish it can be solved or by analytical of the Malaysian independent of the data independent of the training. [00:44:39] So I'm only I'm just the 1st of solving mutual coherence of minimization problem before I have thing Ellen data by solving this I know this non-committal 100 or whatever but I'm just doing the other possible of o.t. this is a nothing but a standard of quite radical programming and it can be solved a very easily and then I get this w. directly plug plotting and the fit this analytical it all flies off the topic into the least a networker just the equipped with solve that Abu Dhabi one and then use my relationship between the blue on the double 2 to solve that Abu 2 and as I don't even without a c. my 1st data are already a thing although our idea of paying all my weights in this new or natural regress or the what else we were meant to learn just as a threshold and the learning rate at each step is that size and the state side and the threshold issue to ration that's all we need to do 1st the computer that by doing a precondition the operation by minimizing this mutual coherence between w. and a and then after this a w A's solved or using this analytical waiter to fill you 99 percent of the waiter in the neural network or aggressor and the do data driven learning of their step size and the search hold that is approval to retain all the linear convergence rate of all the all regional leads to work so you learn the whole network with the millions of arms folded and to carry a prime interest is approval equal to learn just the top 16 layer network of 16 learning rate and a 16 threshold and that's another story. [00:46:22] Here I show you as an empirical read your core result I'm trying to because of the from for to where we started from the original list which is of course radical with m. and Lena always a k. and the lean I was and were due to the servers to have a reduction and another with our final network called Analytical list or a list or have only in Cape rom it Ok privateers k e calls a layoff number and in our 16 layer that means I only have a 2032 parameters to learn from data. [00:46:55] The 1st a version of least develop of eloquence takes 10 hours to train your needs needs 18 paper it becomes a 1.5 hours of training and our latest version tech is 6 minutes each way because it's just a 5th 6232 privateers and here is our conversions mother. This is our analytical use to our. [00:47:17] Different versions of our learned optimizers so the least our ism is over here and you can see that a by reducing the promisor coaster we really didn't us to sacrifice our performance actually our convergence is a more stable and the reach even some holes smaller aero sometimes a sensor to the application of kicking technique modernisers things that I didn't mention here but I want to know is all the prove the theory is expensive all to come Aleutian of our coding we have the force that will prove of all combination a case of clear paper and we also have a robust the results are in the model part of Asian The reason why our model is more than the ball 2 robustness the important thing a which makes your life total East as well casing is a knowledge that a is a solved all of life rather than data true all from data so even my a is a bit different for this I'm usually showing to her as a minimization step we're ready to fight a bit because it's all flies on the from the lies solved. [00:48:16] And also one thing I want to share with you isn't we surprisingly discover in our comprehending experiment that if I'm using random match with a then changing r.n. random to another random a didn't change or almost didn't change the hour learned a step size and the surface I showed so I can train the step size and the search holder using random one random a and it does seem to generalize it very well to different a random Magic's a different a thing without even needle fi tuning that to remain to the mystery to me and by the way conjecture is definitely related to the geometry of random projections of random projections whereas they are actively working on this topic. [00:49:05] All right so long story finished a lot of story I will use a lot of 10 minutes to finish the 2nd part of my story the robustness that well I say that in the very beginning the learn the optimizer wings because it's over fit the purple in the hand so it's expected to work well on similar problems and the data set but you know it sometimes harder to find what is a similar problem and I saw hundreds similar that I said if I try and add a worker to solve those you're trying to try and learn all of them either to solve a neuron resonator would you consider solving those net also mean if you see a similar problem on not a similar problem is the us on times harder for me to say so we have to how the. [00:49:51] I fell back for that option what is my Learn up to my what day of my Learn optimizer doesn't work as expected I have to know that and I have to prevent that I have to have a safe locker for that. So here for you my testing there is I'll fly problem for example in the orange and a lot of herbal we find our a list of the best a version of the spot the newer network has already yet is a still sensitive to the changeover distribution in x. For example if I change my distributer for example I can sample my ad so even with the same as fast as a level if it's a values are from of the newly distribution always a value from a golden distribution I will find them I learned solvers a Harley transferrable So how to make our learning to optimize the most supposed to toward a full on thing problem type. [00:50:43] Let well just like we use a mobile phone well we all know use a fancy phones or not to work the faster button my i Phone has a list of broken 5 times in my life I am not always the one every time i Phone brought So I very much miss the time when I use Nokia which never broke that during my college years. [00:51:03] Or is so similar Salafi can we create our oldest was slow but a solider and a trustworthy backup of 4 learned all the miser to and if I ask you what is a solid a backup for learn optimizer What is that analytical optimizer they are slow they are sometimes known as a serial working but towards a better they are problem agnostic so if I am my Learn optimizer fails because a permanent change is going to analytical optimizer is likely to be much a safer choice but how I should in no want you to fail. [00:51:42] So here is our reason the submission we made or called us safeguarding learning to optimize the I am only covering high level details here because there is not a publisher yet so I consider is a perhaps a 1st effort to make a learning the optimizer generalizable to anthing purple and as a vision here is to create a hybrid algorithm if a learned optimizer is a considered to be an safe as a one a we would do that in a local fashion are easy to ration we will have a career it called a safe guide it is a from the fixed point of the Malaysian to evaluate or from a higher level to evaluate whether my current objective would decrease a sufficiently fast we have some on a little rule to decide that and if it is not I will switch back to this iteration only from using learn to optimize a rule to an illogical rule say great indecent. [00:52:40] And I do that every iteration checking again and again and once I feel like are my Learn optimizer can going back into the right track and a decrease of sufficient faster I will switch back to using learn to optimize it and as a sewage it could happen repeatedly during the trajectory. [00:52:57] So I want to point out the other so 1st ourselves guiding is an anomaly upon it so it is amenable to optimization trajectory were a lot of fluctuations we are low for that is Ok and secondly we have with so far only work on Comex the case is not yet a directly extensible cover non-comic the case. [00:53:17] I had such a high level this is based on the cross and the the cross Gnostic amazing iteration algorithm fix the point of Siri and we derive all our theory based on this and why trysting fact are we showing this we were showing our paper is a measuring learner the optimizer could if you summarize the on the unified across thinking mainframe that utilization framework is a for average approach it is it is a from the field of average operator projection. [00:53:49] And as a key step or from a higher level in this algorithm is a those that you can see that we have decision judging mechanism if for some safety check or what is a chair that denoted as ascii or check the my counter solution like a and this is a smaller than another term which is a weighted of them u.k. Newquay is a derive their from a analytical rule this is a guiding if this is a smaller than the u.k. it were be justified unfaithful then the other safe Then I were to take the solution why k. are my new updated to execute plus one this is why k. if you see from the 5th lie is a solution given out the Candida solution given by learning to optimize and every is not a safe I were you that t. t. here is no to our standard analytical solution by default a gradient of subgrade in descent I worth fall back to the solution and it out either way to every new to ratio I would do that a condition check and use save only use the learning to optimize if it is safe so I I want to show you the result or to give you an idea how it can work. [00:55:03] Here is a result with our own a list of red Morgan so. The the op uploaded is a normal one so you have the testing program come from so here we created 2 girls were programs that one go over programs that we shall the same a measurement Magic's but therefore the Group of 8 percent also the now there are elements of x. are from the newly distribution and the glue approval be from a dolphin distribution so they create a different landscape and if I try the newly sparse and a taste test on clueless pasta our list how it works well so this this is what it is a convergence that if we do the 1st all the analytical optimization this is the convergence if we use the learn optimizer it is in the general very well and a passage in the log is the scale is on log scale so we are 3 magnitude smaller number of iterations to achieve the same a cure see if my testing problem is shell of a similar similar landscape to the training program but a more interesting and untested the case before is this global. [00:56:10] I mean the 20 amp a newly sparse element and a testing goals in sparse adamant So there's a very simple mismatch of landscape you want to see that this is because we want to we want to emphasize the early stage behavior optimizer says the below floater can go back to linear scale sorry for the sorrowful if there are any confusion so this attached lie is a still analytical algorithm was another dog or some doesn't care because well whatever whatever you put out as a work. [00:56:40] But the learn the optimizer which is a denoted as a l. to all this talk here Kells a lot you are finding us in other early sixty's iteration this algorithm started diverging and actually if you run into it were never back it will be amplified further and further showing the very limited a general ability of learning to optimize here but if you switch to our ls list which is learned the same guided occurs carom the main iteration a list of all the name well you are see is a start to perform some more desirable behavior so you see that are in the 1st 6 interation the orange and Alister and that is a safeguard Alister shall the same trajectory which is are the expected they are for safeguarding is not used the true always m is I don't call it the same. [00:57:35] The difference of start to be seeing in the 6 iteration. Which we were a user. With with our use a dash light of peek to show that is safe guiding means falling back or 2 grain in the center cabin as this step and I would throw that thing up early and the different diff different heat of the spigot means a how many portions of my samples that get this activated you will see that this is a 6 a step and almost all my samples 100 percent example agree I shouldn't use learning to optimize they all the good ole us for their own optimization and they all use the grade in the grade of 4 back to the grade in the center that a leader to the divergence between the learned the orange and on this list are and are our new safeguard of original list of avoiding this a big jump of Ariel and to continue the good trend of ever decreasing another star Ok now we come to the 6 or says it's a $78.00 and I it seems to be good because there is no dash lies and no need to optimize is activated I continue user learning to optimize the safely and accelerator faster and then come to the turn and the 11 iteration the start I got a positive again and again all the sample decided I think I missed much learning to optimize is not trustworthy I'm going to go with my analytical solution great in the descent again and there's an I missed the all save going on going on and of the are 2 a much faster. [00:59:09] Is a very encouraging result of the 1st I didn't expect or the general ability mismatch could be saved the back by doing very small very local correction in the long trajectory the steps as an important steps are just like the people life or are only a few taking all the steps the writer you are on the good track. [00:59:31] So now if you just the 2 the 2 corrections that you enjoy all the benefits of faster learning to optimize robustness and accuracy and efficient so you also tested other examples that we tried another learned a.t.m. I'm always I'm publishing as m.l. this year and I applied and they implemented the save guidance and other algorithm so with a similar interpretation the they always work without a surprise or very well on the same problem but a for the program changer you are seeing the you are saying that the average are very fast but if I apply our save guiding we were again be able to correct a fire to the right direction to track with the lower error so we have seen our learned of the My There are quite a generalizable and for the series is based on it is based on complex optimization and of for a larger group of all the major always them that it can be categorized as s.k. the original resume it issues all work well now testing more problems that including quadratic or an include like we have a large scale distributor the non-negative release the scale on the tester and showing that the safeguarding mechanism is indeed a working although the 1st all the complex verbal so well in general quite happy with that attack whole message on that is to augment our communication with a machine learning you cannot just enjoy the benefit you also needed to pay attention to the safety issue. [01:00:55] Always or you have to make can use them to decide when you are trustworthy and they usually use it and when it is not trustworthy just to have a fallback option which you know you were very safe and then to match is a to faster track when there is a safer option and we have ongoing work than sending this safeguarding to come as have minimization and of call the Holy Grail were to make it a work on non-combat self-immolation and then eventually to even out over training which is my interest but. [01:01:24] We are progressing on that yet. So Ok I have so I guess I can just go through the page so we decided though that you're going to see that we have a I have a Fortune Truth and a lot of personal although my own background on comparison is that we are now also extending the learned to optimizer to many more problems than you know a lot of them all this year we've given relaxed and much more practical conversions approval for Plug and Play optimization this is a popular option in numerous problems in competition I imaging and it will also give ways to properly ensure faster convergence I think I'll always and now is a quite a will receive the water in the competition I imagine societies in the north of this year we have introduced the learning to optimize the strategy to my best knowledge or to base an optimization for the 1st time we will introduce a multiple as t.m. base the solver of all particles warm up in the nation and applied at that important docking a classic call by the medics up the problem and our several works we have a summit on there's a mission somewhere recently we have applied to solve some quite a challenging problem for example I'm a well some colleagues that he got in a car using learned all the might to solve a graph on a comet or real optimization which are well over $4000000.00 on the mile we also applied the Learn to optimize it to solve integer programming with a practical application to us that is to solve a quantized that you were in and work that is a very practical to us and we reason also have the 1st paper on learning to optimize the computer networks that which are we have the pulse they get above your interest they're welcome to request the manuscript of army and we also have our in the work for developing the 1st to learn the optimizer for s.p.d. that can probe all you find to crack the egg and vectors which I want to make are emphasizing here is a correct a convector not a correct agan space I give you the vectors and allow police I'm McCullough breaking with a few graphic Olivers are you now writing a book chapter of all the vocal mathematical proof must not overwhelming. [01:03:27] Learning by the optimizer we are also our student also calls that we are solve the water all bugs are now the hole for the well if I don't delay which I already well hopefully will come to come out of. Nowhere come to pay attention to that and they always I just are summary I think there is a heard a rumor of improvement or by learning to optimize it by this a positive or healthy over $50.00 and no we can't buy kill for 30 or study we can get a greater benefit of data driven and the benefit of analogy with a more careful design you could also generalize a robot programs and their things they are getting the robustness and safety you wanted so in a high level picture of why all of communication how much are you learning there is a great potential that in machine learning can help machine optimization as well which we hold more than you were joining us to explore and the last page is always dedicated to the student who actually do the work and they're probably doing nothing but shamelessly taking their credit that's all my talk thank you very much. [01:04:43] Yes please. It's another necessarily the same philosophy like in the comics problems so we are more forgiving on the serious side and the courage of ours will take the form of iterative solvers that because there's the easier to derive series of you know those the there are a changing though the comics problems there is a really easy because in most of the conversions a relatively easy because the most courageous approval they don't require specific operators there just to require the operators satisfy certain conditions like average or that nice congenial So as a lawyer you can use your. [01:05:27] The Pervert is being insured in your learning base the modification you are still good at with a man on the property that's the magic behind it but for those non-combat purple and appears on television wish away anyway I don't have any good theory as a small learning towards the using fully they're driven towards a life or else t.m. I mentioned to you anymore though that with regard to the orange and all that exploring the body we are executing different philosophy depending on what we want. [01:06:07] Sure. Bye. Bye. Bye. This is this is a very good question and we haven't tried it yet have been calling our last a virtual model we are ready to cooperate of the solution of the top legal cure from the saloon the training of your network is your 1st of all the people and then do they do in training so if you are used to working with us are really good are perfectly Marjorie's are like I say random actually for you perhaps the do not even need to retrain the network apart or whatever job you are using because all the time possible random magic the geometry and the you for you are updating that your own life and that of your house are structural with the factories where you have to perform a we haven't tried it out yet because it's a bit different from the application but I saw that there you can try it and. [01:07:03] Because because now is a worry incur Tooley ourselves and I learn you right I'm not sure about the serious side of volume per call the I with our list to give a try updated Abu Periodically I didn't do our thing by holding up data to apply the updated our view into the model and then to our last days of the model since about the lives that. [01:07:26] Thank you.