Hello everybody. Has Dr learning mark around and feels are graphed on models and other things and this is joint work with an enclave and some U.T. Austin this dog is mainly about an unsupervised learning problem where the highest level work you're trying to do is the following You can see samples from some distribution for as let's say a fix to be a distribution on the break you and based on the samples you would like to learn some properties of the distribution and today we are going to be interested in one specific property Richie's specific and important property which is to figure out a dependency graph for the distribution. So what's a dependency graph of a distribution. To high level basically a graph G. is a dependency graph of the distribution if you look at the random variables X. one three X. and and you look at two What is this in your graph there is no edge between those where does this then the corresponding random variables X. and X. j should be independent on the neighbors of let's say the word. So in some sense since there's no as between i and j here if a condition on the neighbors of the variable X. that becomes independent of exchange. You can go from here to ANY got and so on. And at this highest level if ask you just disappear dependency graph it's not a very valid find you can come up with to get answers and so on but you can formalize this in a nice rain and there is a meaningful way Rich to do that I need to first talk about this Hammersley Clifford theorem which under some Norwegian is a conditions say is the following G.'s a dependent graph G. is the dependency graph of a distribution. If and only if there's some vision there's a conditions that we want to get into if the probability density function looks like this it's proportional to exponential in some off some local functions of the variables and these local functions are functions which are support there only on the leaks of the graph. I guess you look at the leaks in the graph G. and for each peak you're allowed to have a function which only depends on the variables in the peak. For example one very important example is the using model where the bubbly density function is exactly equal to exponential or edge exponential of some of edges in the graph with some base W I J X A existing and then you have some mean field terms linear terms. And more generally you say you have a B. M.R.S. But De Marco random field if in this a mission or clique's you only look at the leaks of size that most. So be close to corresponds to is in marble and stands on our general mark around the fields. And these are also known as graphical models and learning theory and and in machine learning some people call is in more liberals machines and things like that. So here's a question now I give you a sample some A T M R S D mark around and feel you should think of as being small who is using more or less already very interesting three and four and so on. And our goal is to figure out the edges of the graph. And give the samples kind of downward the edges are and this is known as the structure learning problem in machine learning there's also the corresponding question of parameter learning but you're also trying to figure out the rates and many are for results also extend to figuring out the rates and the functions and so on but only talk about structure learning today and that is also do necessarily get a little weaker. And you go to bed I mean you're learning and this is many applications and mission learning I said especially its use a lot of national language processing trying to figure out pretty in pretty interactions and things like that. And this a lot of prior work on this problem which I'll get to when I actually see the results and so on. So the running example for this stock is going to be the simplest case where you have it using more and in the most general is a problem cannot be solved he had to place some constraints on the graph. And will think of the graph as being down the degree the degrees may be constant ten logon or something like that in the Vedas are numbers which are boundaries or between some constants again. They cannot be too small or too big. And the comparison for us is a brute force algorithm which transcends time into the database could just and numerator all possible neighborhoods and check if it's a good thing and then so on can you do better and in fact you can there are works which did this before as but let me tell you our algorithm because it also generalists they are after and I can start by describing I would get on the moon. So for each sample you know for each possible choice of D. neighbors. Right you tried to figure out the distribution exactly and see if there is some correlation and things like that. So. If I ever figure out you forget about the graph as a whole OK. Just imagine you. Are good one way to solve the brute force is our use that our. I know. That's probably is at this point. And then. Right. Yes indifferently. All right I'll tell you got out of their way to do it. Right and to describe the algorithm all they need. Simply had to get them as you'll see and what I need to signal our function which is this function which looks like you know one or one visit to the minus zero or the real numbers and it has this nice shape. And it comes up naturally as we'll see later on. So here's algorithm so I'll tell you how to figure out the neighborhood of a single word. And when you actually implemented you do this and for all the neighbors all the word this is what I focus on one word. So what are we trying to do we see a sample from our distribution let's call it X. one day. And the way that I'll get them works easy to do this thought experiment which is imagine. The I it variable X. i Phone and i tried to figure or tried to guess what excited is based on the other reader bills and begin each other. Guess you think of eggs J. As a guess for X.. If the guess is correct meaning that it has some correlation again correlation is a ready ready big word but would make it precise then you're rewarded penalize the guesses. Playing this working game in some sense. And this game will actually be also useful in analysis and how do you are they are rewarded penalize the guesses we use multiplicative rates which is. A classical louder than learning theory which is using or things seems to describe the actual outer them I need one more definition which he is using Marley can also be extended to M.R.S. so I need this parameter lambda we do think of as the L. one norm all for the core fissions of a single what ticks do you have for each one day if you have a bunch of course fissions and this is the max norm of the corporations at a single what. For instance for the running example that I mention when they got this degree D. and the corporations are all constants this is or. Are it so I would say we are going to play this game with start by initial by also doing another thought experiment it is we break up each word takes J. into two word to says J. plus and minus this is just for. Simplicity of this point and we think of gyp less as having positive correlation with i j minus as having negative correlation. And so we need less rates less and minus Initially we don't mean it things that I just said them do want to do in minus one and their actual guess for the rate of the edge. Which I do not. Get the difference of these in obvious C.S. sample X. one through accent and we use the. Variables to venture a guess for X. I should be so imagine excise hidden every word X. gets its own weight. And they get something and their overall prediction for what X. I should be is the rate of something. And once you see this prediction we check how good this prediction is and here is where some work. And that the structure of Mara shows up and this formula will become clear litter on but for now let's just take it for granted so each word. We say in course a penalty how good it's guess is is given by this number. And so for each word fixed J. There's a appears there and then this is depends on the prediction and the actual value of exciting. Seems a little mysterious but if you actually dig into it there's just expectation and so on. Is the number. Right yes yes yes yes. And as I can basically condition all the other variables except you know if a candidate in all the variables except excise then the expectation of X. then the probability that X. a close one is given by the sigmoid of the many so it's going to match of that we'll see that Y. that appears. OK And then once you have this last August you ARE THEY DO rates. Using the multiplicative rates framework so that the idea of class gets updated by all the values and some multiplicative factor depending on L.J. minus the same thing with minus L.T.. And finally we should i said we should think of JR guess for the actual rate as the difference between the two but you had to do one normalization to give the most lambda. But don't worry about this normalization it's the same for every day the main thing is that you're looking at the difference between the positive in the negative it this was it the norm of the coefficients of a vector So basically this had ideas that we're trying to figure out there are non The noise at most. And we're trying to maintain that throughout the course of that and that's it so this is the whole algorithm Here's one way to do. Sorry sorry that's a typo it should have been I care I care here this is just normalization Thanks that's what I would assume. And so you markedly if you know your stick as a gate and descent this is somewhat like a stroke us to get in this and I get them with a different last function but not exactly stroke as to get in this and that doesn't quite work out to change it and it comes to this and what we sure is that this algorithm actually can work has to the true rates read the nearly optimal rate of convergence or do you mean so after two to the lambda logon or epsilon to the four samples the rate that you have or epsilon close to the true rates. For every for every. I mean we're in that we're making of eight that the actually edges their elder Naam is only lambda. In the W. I just initially sum to one. We started out with them being one so if there was actually one then you could do it with this normalization. But then there are what the bonder degree case and. So roughly log and samples if you think of Epsilon has been constant and lambda has been some small number you can rush to the true set of rates as a car all three if the minimum weight in your is in model when Mum coefficient is in more or less at least Epsilon and this Maxell one numbers at most land you can recover the graph in two to the lambda tense logon or epsilon to the four samples. For instance for going back to running example you can record the graph into two the degree times log and samples the important thing here is that the dependence on the dimension is logarithmic so you're learning something and you're learning a graph of size then using only logon samples especially in the degree of small. And the thing that was known before us this is a diamond ring right we showed that this dependence is nearly necessary in the sample complexity. That there is a lower bond of two to the Omega logon by epsilon So this exponential dependence on the L.R. non e's necessary the only place where off is basically in the constant and the dependence on absolute. Information theoretically you cannot distinguish can be done I don't forget. So lambda is basically or. Is. There only because fissions each one is that most some constant. So so why do it that is like you know this information data globe on. So even for a bonded degree it is there's information there Diggler about it. And the degrees D. then our algorithm runs in it to the bitter details logon samples and the corresponding lower bond of two to them. And there's a corresponding Lord one of two debated. Yeah yeah that's it an example that's what you should keep in mind although I'll tell you one thing later on. And. So and another interesting thing here is that they're running for algorithm is quadratic So it's basically quadratic for each example and your process log an example so it's all to love and square but the case when the graph is beyond a degree and the weights are constant. And this might seem and there's a rainbow is connection to Bressler moss and sly from two thousand and thirteen who connected this problem to our. Learnings prosperity with knows more generally but in particular an implication of this connection is that even when the graph is a single edge. And the weight is a constant identifying the single edge the best algorithm we know run same time into the one point six two and this is due to reg relevant in two thousand and eleven and is actually will lead to an old problem of Leslie really and it's a little problem so one very interesting question is can you get into the one point six two for the more general case of graphs instead of just a single bit. And where does this single. Rate. Identifying that as the best end to the one point six two and we can do all graphs in square. I meant like human dignity bonded to guys. No no no no I said the best algorithm known is end to the one point six two. You know no. I mean this is kind of did do something Leslie violent posed back in the seventy's and the improvement first improvement from and square as often in the dozen eleven regularly and. Maybe you had to wait for the next generation of alien. And one video. I guess to let me say a few words there were prior work so. We're kind of there is sort of the difference two different streams one is to make assumptions on the distribution of the zing model that you get for instance if the model has correlation then Breslin Walsall and slaying of their work shows that you can actually learn the. Graphic better parameters a bit around times and so on and the line of work is to assume that the dates have something called incoherence and I'm going to find that this is from machine learning community and they also get results but these are as I'm sure it's on the structure of the distribution and the graph. And the most relevant for us is this really nice work of Bressler from two thousand and fifteen we shored that. Except for the third one as I mentioned on the underlying distribution but if the graph is bounded degree then there is an algorithm rich learns the graph in Dublin explanation the degree log and sample and once again the dependence on the dimension is logarithmic which was the main take away from his were. Those only using more so if I want to say good results for these NG more I'll tell you. A few just a few years ago there was a paper which improved this double exponential running time to sing the exponential and the runtime as end of the four but they also had the zero mean field as I'm sure they couldn't handle this lean year terms. And none of these things talk about the games where you have more general mark around them what if you have a team mark around a field what if you have four mark around them and to the best of our knowledge we couldn't find any. Rigorous these bridges not placed as emption on the underlying distribution. And once again this connection from Breslin Walsall and say is that learning team mark around them feels is actually as hard as a Revell started problem in learning theory known as learning sparse parities with noise so here is our result we do that if you have the mark around the field you can be fine and longest sort of bridge but I'm into Lamda So earlier we were looking at some of the one of the core fissions or to single word pics now we had to look at sort of every one of all the good fishes that involve a single word in the cover of how degree polynomial I want to define it was not important for. The actual. Ideas behind the work. It's ready and longest to lambda and you also need some kind on the identify ability of the graph of the marker and feel and so you get into this way they can be different gas will give you the same arc around them feel are really trying to learn the graph so they're different grass will give you the same arc around and feel that you cannot learn to get three plays excellent identifiable jazz ocean which is analog us to saying that the non-zero coefficients are at least epsilon in magnitude. For Easy more so in this case we can learn the graph with roughly the same parameters is due to the off times the times log on but Epsilon to the four samples so the dependence on the dimension is again logarithmic and if you think of one as constant. Yeah. Actually an arch your maybe epsilon Square. Could be absolute square but I'm not sure. It's either a great or is in the Explain and it's not absent of the four. I think epsilon squared is a dancer and there's a fundamental reason why our thing doesn't extend in fact it's one of the open forum that I mentioned in getting to plans criticizing the difficulty using this. And so for instance if you're in the graphics diggy diggy bonded degree then you get to do the OR D. to the logons. And the sample complexity once again is almost optimal by the results of some Vandeman being right and they run time after I get them of this I got inserted by the periods entered off so it's Increase us with the M.R. of sites. And enter the lofty seems like a lot but. If you use this reduction of B.M.'s rigging and to be here would imply breaking a long story problem in learning theory sparse parity with my eyes there is conduct or at least many people believe that you cannot do better than into that D. So this time it also seems optimal assuming the learning sparse part is as hard. As Apple complex is almost optimal runtime is often but under some assumptions. And independent of work Hamilton color and miter extended Bressler streamer restless information through the framework to D.M. R.S. so they also get bonded degree graphs. With doubly exponential dependance on the degree and logarithmic number of samples. But their results actually has done work over distributions which are not just on the hypercube you can a lot more states more options for each rating. And one thing I said didn't mention Venice talking when using MARL is that. Our it is alt doesn't need it can also work for grabs Richard Knox sparse as long as a raid it's small. But all of the previous results near the graph to actually be sparse that the degree be D. in are just the rates. So for instance you're going to learn a star graph where the race for one hour and then follow it. OK so that's what that is or is that me now talk about I told you that I would get them I told you the result of me tell you what's the connection to get them and how to learn it how Dan. So the main lever we use comes from this fact of what marker and feels and second I specialize to is in model because there seems to be the. It gives the main ideas anyway. So recall using model has this P.D.F. and one easy fact which follows from this expression is that if the word pics are available the probability that acts like Will Swan condition on everything else that has a very unwise expression in fact is the sigmoid off a linear function evaluated on the remaining readable of an Afghan function evaluator on the main. And the corporations of the African function and director spawn and do the rates on the edges of the guy so that's where the signature it up in the algorithm do. And using this fact we kind of abstract the problem a lot and recall us just the starter of the unsupervised learning problem with just the samples and we had to learn the graph for some property to make it into a supervised learning problem in the following sense. So let VI be exciting. And X. bargainer all the variables except exciting and so I'm going to abstract over the problem as follows Now I see samples of the form. They're all in the helper queue and thinking of Y. as a label as a classification of X.. With the condition that the conditional distribution of why has this form your probability that because one given X. bar is given by the Sigma where the linear function is on the. Samples can they learn the unknown coefficient vector W.. Be darn proud to learn they does so to figure out they do you need more samples because you get these degenerate cases where you can embed into the other corporations a little. Very little they can make. One extra divided by and expects to get some annoying things. So this is our super as a learning problem that we're going to study. And. So what we actually do is. Find given some samples explore come away as like like this we tried to find another linear function you would just close to our unknown function in the sense of squared loss so the sigmoid evaluator a linear function is close to the segment evaluator on the true function up to this delta. So let's say we solve this is it enough it's not clear right because even if the squared loss is small it doesn't necessarily mean that you N.W. or close are that you actually gives you some information about the true coefficient vector W.. And here is where the properties of the distribution shows up in that we showed the second sort of structural part of the paper. Is saying that if X. is drawn from the easing model. Then. If the squared loss of UN squared loss of fuel with respect to W. small then actually UN doubly should be close they should be close in this to need to distance and here is where the rift parameter shows up. As due to the off land and square root of the out of order of the error you had in the squared loss. Is that it sort of this two steps in our analysis first try and solve the problem of try and find a US minimizes the squared loss and then there's a structural part of it says that if X. is coming from some. Model then minimizing the squared loss means that you're also close in terms of questions and degree that they'd imply that you learned. So how does this uproar this two step approach actually carries over to General Mark around them fields do you mean by declare you get. If you work through the same stuff as we do for these in Marble the kind of distribution in the using model had this linear function for half the country distribution becomes sigmoid off a degree T. minus one palm nominal in the remaining readable so we tried to solve the same squared last minimization question find a polynomial Q So the Sigma X. minus It must be off X. whole square it is small. And then there is again the structural part X. is drawn from some distribution X. is drawn from the M.R.F.. And the square less is small then you can it's no longer true that P. and Q. are actually close to each other but we can recover the support of P. from Q. That's something we can. And this part of this the second part is actually the most involved part of the bit pervy really have do. This because it also involves an algorithm inside that it's not just look at the coefficients of Q We had to put together some core fissions together to figure out the support and so on. And so in the remainder of the talk and mainly focus on the first step how do I mean the squared laws. Given this samples expire Camelot. So let's go back to that using model where we have this expression that it probably of why given X. sigmoid often are normally found functionary trying to figure out the linear part or trying to find a function you which I just model squared loss first thing to nor does that this is a non-convex of to musician problem. And even it's a non-contact optimization problem this is dealing that is were shot C. from two those and nine who gave an algorithm which they called. Them because it's some sort of generalization of percept try an algorithm. And this solve this problem Palami Eltang the show that you can find you come out with side says this guarantee. But the sample complexity was an overall condition of in Delta so it was an hour Delta squared. So this is good if already pretty good in the sense that you're getting polymer sample complexity but remember we're interested in getting logarithmic sample complexity. We cannot see and samples we only have a log and samples. So the one thing that just don't expect exploit is that the unknown function is sparse in the sense that it's a one Nami is that most of it. So that's exactly what we do the algorithm that they describe in the first in the beginning of the dog which I call Spock's it drawn is kind of generalised oppressive brown eyes of drawn now there's an alpha transit can work drawn and this other trance coming up and we can find a you with satisfies the squared loss guarantee with the number of samples is just square of the overall number of W M slaw again divided by double just quit. And this is for any distribution Next there is no distribution as I'm sure on X. So initially we started out trying to solve this where X. was coming from some using. Model but now this word this learning part works for arbitrary distributions. And going back to question of you bring the dependence on Epsilon. This one or don't just grid here that seems state. So that's why seems difficult to use this approach to beat the epsilon to a four. Hundred W. some unknown you have some distribution X.. And the condition you're seeing samples of the from X. come alive X. is the sum arbiter of distribution. Where Y A Y X is arbitrary why the probability of Y. given X. is a sigmoid of X. That's it so X. is unconstrained there's no constant in the distribution of X. for this part. And I mean this model like you know you have this problem of Y. given X. coming from some class it's known as a problematic on's of model in learning theory this indeed is the current and ship it back in the ninety's so. It's a good learning sig Marson The problem is to concept model of currents and ship you. All right so how do we do this. So let me describe the algorithm and I already told you that I would get them but let me tell you how to analyze it and to do it to tell you how to analyze it I need to introduce our briefly to the view learning with experts framework from classical learning theory so what is the set up here let's say you have an expert's think of each expert you're trying to make some predictions or trying to get some outcome of some experiment or you know the stock prices are going up is it going to rain today or things like that. And so you play the following game with the end experts on each round on each day let's say you're asked expression some predictions so the predictions for us are just values in one minus one. So for instance on first day it equals one expert one predicts X. one one X. one two so on X. one X. when it. And once the predictions are made they incur some losses so the lot the losses are arbitrary They're chosen by the water which is a just numbers between zero and one so the first expert gets lost as one one and one two one. And so on you play this game will still the same thing happens on. And one of the trying to do here the way we fit in is that we're trying to work on top of the experts and our goal is to maintain a distribution on the experts. On Lee based on the past information rich does as well as the best expert in hindsight you're trying to minimize the regret rich is explicitly described this formula so maintaining a distribution on the experts on each time step the beauty and you want to minimize expected loss that you incur from that distribution and the loss minus the loss of the best expert in hindsight and the classical or great problem now how does this guy next to the problem of learning Cygnus with knives that I talked about we see in some sense force the problem in so we call or question we have samples from X. come a Y. probably did Y. given X. has the sigma in the form and for this starkly in the W done on director I also ignored the constant term let's also do that there are non-directed Dubuis non-negative and there's a one on one. If not you can do some padding to get it so initially you don't know anything so let's just say you give a one or and to every expert so you don't exploit it you vision one or N. to each. Card in it now you see an example X. one Y. one. I mean remember I had this. Split in the lesson day minus that's exactly what was done to do this in the end so if they're non-negative you can sort of hear two Carden It's where one is the one is X. and examine the other is minus six. And. So you see an example X. one by one how do I bait my current rate director W. one before set into learning with experts game ribbit think off X X variables as a predictions of the experts so each garden it is an expert so X. one eye is a prediction of expert and then we have this formula for the last enclosed by the expert. You have the one half of. Minus Y. explaining this is the exact expression that I showed in the first part of the learning. Using model this half and for you but then that just to make sure that this number isn't zero one so that we can use the classical learning algorithms and their analysis without changing anything. And that's it so. You have this way director you run your learning with experts algorithm and you have data rate director and then you get down to the sample of the new law says and so on. So you get a bunch of eight vectors W. one W. two up to W. D. and in the end you are put the one which has the best squared laws are the least squared loss right so that's how you can use any learning with the express framework to try and solve this problem in particular if you used to be good to his family and your good the algorithm that I mentioned before so how do you analyze it the first step which is not to. Follow this from the properties of sigmoid is that the squared laws are fickle efficient director Rich had this expiration is at most. The expected loss incurred by the expert in that it's exact it's at most you know prior between W.T. and L.T. minus W. times. So that's a down next to the regret naturally and if you just sum this up you get this some of the square law says is more some of the expected losses in the game minus W. which is the unknown vector times the sum of the losses but remember that W. was a non-negative director which aired one on one. So this inner product eases. All of us at least the minimum loss which is exactly that it so we just showed that some of the square law says that most of the good and now you can use your favorite learning with the express I will get them which means that we get for instance we use the hedge algorithm to defraud and then ship it from one to seven which essentially the multiplication rates to get them that I mentioned the beginning says that you can get three get square root Pete and log in after two rounds. And if you use the best W.T. Here are the losses going to be at most the average Richie's squared off log and or P. which becomes dealt a facility to be logged in the window described. So that's the whole analysis of the algorithm except for this first question which follows from the fact that sigma is a monotone function is Lipschitz so algorithm actually works for any moment on its function and I just ignore. It so this squared loss we bonded here how. So. Basically if you look at the kind of expectation of why. It simplifies very nicely because let me say one thing that. If I look at Sigma into a real values. And so this is less than or equal to him and his B. Times there's an equality that you noted. So it's somehow the squared loss even though we used at the beginning it kind of. Disappears when you do this and sigma of M. and A Sigma B. It becomes the difference between the two of things and you get some expression there's no I don't have any inside who we have during the six of the store three lines of calculations so let me if so that's about the analysis of the algorithm. And then I think the loss is so this Bard like you know the sample convexity a log and I would build the square I don't have proof of there but I think that's tight. But. We're not exploiting the distribution X. in this part. So somehow you need to exploit that I think. For arbiter distributions if you were asked if you were to ask me to guess I'd say Here are the one or Delta squared losses. OK And so in summary I told you how to learn a model with due to the off London's logon or epsilon to the four samples there's also generalize a still learning marker and feels there's a first sort of provable guarantee for the Greater than to be three and so on. And the sample complexity in book is This is almost optimal except for this dependence on Epsilon and the constants in the exponent and they ran optimal assuming. You make this as I'm sure learning sparse parity replies. And there's a problem there are several nice questions here the first one. There's also a lot of work and mission learning and war going on as Gulshan graphical models where instead of the distribution being on the hypercube Yes and some sort of gushing distribution with some covariance matrix. But all the known result for learning spot Scotian graphical models in machine learning even though the basically stated bounds there are still very good condition number. So there's a big gap between the information to degrees also and the competition will result in terms of the dependent from dependence on the ground number and for the learning problem I asked I told you how to learn. And sigmoid of linear function. Why do you have some of two sigma. Sigma exploits to the techs. Can you do anything here and this is connected to if you can do this this is probably have a very strong implications in learning theory in by declarer you can learn using models with hidden variables arrested deadbolts machines and then how to ask connections to learning higher death neural networks so this seems to be the first bottleneck in trying to do it works thank you. Thank. Yeah. I mean we didn't do it for using model wood for the ninety's and for equal stew we did. For the ghost of a hired B. I think it works but the structure lemma that I mention is quite. Because very technical and. I believe Richard work but we weren't able to do it for hire from parts more especially when these more than the other paper does do it. Or started again confusing the name here so you're asking what equals do with tire out of it that we can do that we can OK that we already can do and you can ask for bed bigger T. that it should be possible I think but rebrand able to do it. Yeah. You mean for the Sigma problem or for the. Graphical more liquid so that's known this information there to Globe on that you need at least to the lambda. We get human runtime or sample from run bang it seems necessary because. O.O. I see what you're asking. Runtime doesn't is not exponential in its end of the. Make sense it's not exponential in it's to do the lambda into the down time. And that seems staid because you need to do that I'm examples and also because it's going to change the learning sparked by a to replace you know question.