[00:00:05]
>> Ok so 1st a file welcome everyone's we're today mission our new seminar series today we're fortunate to have Professor was in Thailand for a National University of Singapore so a short introduction to Professor tend his is he received the da and the master engineer degree in Electrical and the physician science from Cambridge University and the patient degree in electrical engineering and computer science from mit He is currently a senior chair associate professor was Department of Electrical and Computer Engineering and the Department of Mathematics at the Show us of Singapore and us its research interests include information theory which Sharone and statistical signal processing we had also he was also information theory society this team Bush lecturer in 20182000 Nike He is currently serving as an associate editor for I should with the transactions seen a processing and associate editor for the shilling for the actually transactions information theory is a member of I should be Mission Society board of governors the Winston to is a very.

[00:01:25]
Established researchers working in area of sure any and that information so much whether or do let's welcome Professor when he tried to give the group out today thank you. Thank you thank you very much did you hear me say yes and he's a great. Ok so this is what we've.

[00:01:48]
Done Ellery hand and she out 2 from the National University of Singapore and its work is about learning tree structure models in noice I would describe what I mean by Shockley as but the emphasis of this work is really on us x. beyond just. Beyond just observing samples across the pond on and these 2 aspects execs and politics and robustness and I'll talk about these in detail in the following so.

[00:02:21]
Let's get straight to the material this talk is about graphical models and graphical models is an area of study that can be considered as a marriage of probability theory and profiteering to the laws correspond to random variables so there are 3 random variables in this in this picture and the edges represents the difficult dependencies between the variables so for example in this picture here we see that x. 2 x. 3 are conditionally independent given x. one because if I remove x. one from this graph x. 2 x. 3 are going to be separated and no longer connected and the joint distribution factor rises as follows Ok so in general we can talk about larger craft and larger graphs will be said to be satisfying to Markov property is this condition whole this basically said if x. is condition on everybody else is condition all the other notes are right equivalent to x. I being only conditioning conditional on its neighborhood So basically the neighborhood of a particular block from the rest of the world Ok the graphical model has extensive application in a variety of areas such as image processing company thorough optimization and so on so in the in a real image processing this may because a particular image and and images have some form of those properties.

[00:03:47]
A pixel its kind of close to its neighboring pixel and much of the is not the image but perhaps 90 version of it and based on the noisy version of it to light in from the original uncropped the image subject to some smoothness constraints and that is encoded in the neighborhoods of each pixel.

[00:04:07]
For this talk to be focused on the structure of graphical models the 3 structure graphical models of models in which the probability distributions that factor according to trees when and under a tree when assume that no one is the. 2 in particular if we have Peano's then the graphical model factor rises as follows The Root wonders of any parent but the rest of the nodes only have one parent.

[00:04:33]
Unit parent. He knows the unique parent of not I So this is a particular example of a tree model so here we have 5 notes and the one and p. to the visitors and s. one to s 4 could be symptoms that are caused by the so my talk will be structured in 2 parts the part one will be on homogeneous 3 models and Part 2 will be on one of the on homogeneous tree models Aqueduct in identically distributed noise and here we will use the formalism of exact and politics the devise some really counts on structure learning the average probability of structure learning and part 2 in part 2 we were generalized nice ideas to the theory of not identically to Noise and here we see that exact tree structure recovery is impossible in general and instead will be will be seeking another alternative right basically instead of exactly structure recovery we'll be trying to recover trees up to a certain so-called equivalence class and we call this partial free recovery our talk about this later on when the 1st part of the talk we will talk about exec and topic and there's an area of study last mediation theory of how we can use this idea to refine estimates on the error probability of structure learning.

[00:05:55]
So in my talk our focus on binary graphical models a piece of graphic a model's image issue and a variable can only take on 2 values. In the in this part of thought they can take on the only 2 values 0 or one so we're going to make a bunch of assumptions to simplify the notation we're going to assume that the graphical model has what is known as a 0 x. than a few that means that all the marginals are uniform and in particular because we are living in a binary alphabet imagine OSA haha for all the know you're going to assume it's a condition known as data homogeneously it's been that to not agree with probability one minus data and they disagree with probability data and write information theory pollens to neighboring notes are disagreeing with each other according to a so called binary symmetric channel Ok so there's encoder in this condition the disagreement probably piece data right and because of the external view this is divided by 2 so that's correct once a homogeneous you can model with your experiment field homogeneous because all the autocross of a probably these are the same and it or across it will probably be the less than half which is what the a sealed it implies a positive correlation among the edges so we are assuming trees assuming homogeneous models and we are assuming Xeroxed on a few little to simplify some life significantly to do a very precise analysis So yes this is the class of models that we will be concerned with and if encoder is like a set here so will be given the argument and I need some posts from an old tree distribution that satisfies me to the properties.

[00:07:44]
And given these ensamples of course we are in the business of the classical learning part of learning the underlying tree structure of the distribution given these 2 pieces of site information and we are interested in this interesting controlling this error event that the Learn to restructure is not the same as the original one now we'll be using the maximum likelihood formalism so the maximum like you estimate is the one that maximizes the likelihood and because we are assuming there are the samples are independent the maximum like you estimate.

[00:08:17]
Is with the one that maximizes the sum of the lock likelihoods. So in order for us to develop an algorithm for this objective Now it in requires us to do to talk a little bit about empirical distributions the empirical distribution is basically a normalized count of how many times you see each sample write so that's what exactly what is.

[00:08:39]
And the maximum likelihood as the meter is basically the one it can be shown rather easily that is the one that minimizes the relative entropy between the and that is a distribution and a distribution that is marked off on a certain tree with a certain homogeneous parameter. So as you can see here this is the empirical distribution that does not live in this set here but we are looking for the distribution that is upon projection is a projection onto this that is the close us in terms of divergence.

[00:09:13]
To the empirical that is known in the information theory literature as the reverse information projection there are some information projection Invision minimizing with a 1st beer a book though when when the crossover per probability is the one the structure learning comes out to be exactly the same as distribution dunnit.

[00:09:33]
So we have talked about the joint fact that it's a joint empirical distribution they now let us be fine and as a quantity known as the agreement is the agreement between North i and j That is the probability according to the Johnson type of empirical distribution that 2 neighboring notes the same value and given by this article random variable now then a result of our say that.

[00:10:03]
If you want to find the maximum likely was set up just that all you have to do is to populate a complete graph with the schedule and to find the one that has the maximum we so this is a simple there evasion but it lends itself to a lot of interesting analysis that we can do later on so one more time so we can define agreement on the agreement of quantities one for each pair of notes and we then populate a complete with a graph with its weight and we find the maximum width an increase in order to find the best distribution that is a mark of on a particular 3 graph with a particle across a low probability theta so in the absence of sight information this derivation is not difficult and is inspired by the so called child overbill that goes back to the late sixties connections of such information the tree can be learned via the child over the where in place of the agreements we have the empirical mutual information which is some nonlinear function of the empirical distribution Now if you already know the 2 pieces of information the homogeneous the information and is your exponent field you can actually simplified moocher information to something more tractable that is the agreement a sort of agreement simplifies or empirical mutually information and it lends itself as I mentioned to very nice estimates of the Arab probability.

[00:11:29]
And how do we estimate the error probability the various ways of doing so and one of the ways that has been studied actually by myself from many many years ago is one that in an ever exponents the air exponent for maximum likelihood algorithm that we employ is given by this quantity and is shown to the limit exists so this is well defined for the air exponent characterizes the exponential decay rate of the era probabilities with the number of samples for the error era probability have to be estimated as follows For now if we have the 2 pieces of flight information we are able to actually evaluate the error exponent close for and the only parameter that the air exponent is a function off is this crossover probability.

[00:12:17]
Of Fox result safe that the air exponent can be evaluated easily and it is a simple function of the crossover probability and what do we know this is a 2 x. 10 in cases top of all the random variables are perfectly correlated so data is equal to 0 right there so cross over or so in the case I think that it was a 0 The argument of the log rhythm is one and so the air exponent 0 learning is impossible and when the data if it was a half a simple thought experiment to plug a half into a dress this expression here also shows us that the Arabs phone is 0 and the case with data is equal half corresponds to the case where everything is independent The learning is impossible I will show you a graph of k.p. or exponent later on.

[00:13:08]
Another of our results is that the Eric's moment given the pieces of such information that you have 0 x. than a few and you have. Is exactly equal to that or exponent would be sure to not have these 2 pieces of information the what is says is that there is no advantage from Iraq's perspective in Norway that the tree is has Iraq than a few and the 3 is homogeneous are other what we observe empirically is that when the sample size is extremely small these 2 pieces of information and the usage of the agreement in place of the mutual information use or flight response or probabilities over the venue.

[00:13:53]
In which we do not assume the 2 pieces of information available to let us compare our result to some tomato books that were developed by some other authors recently the Breslin cousin in the annals of statistics paper consider gender I think 3 models that allow for different correlation among the ages to slightly more general but if we are specialized a result of hours in which are the edges at the same correlation and we caught a correlation data right then the Arak spawn is as follows also across from expression but what we can prove is that our exponent is at least 3 times as large as the x. So for any distribution we have any sort of parameters are Erich's own is a least 3 times as large as the s. and ever exponent The larger the top this implies that Bressler and cousins about on the air a probability is a rather loose doesn't particle and we'll see this later on through the simulations though beyond ever exponent which is a quantification of the rate of decay of the era probability.

[00:15:05]
We are interested in the theory of exactness and tonics we are more ambitious but these are some complicated definitions that you do not have to internalize but I have liked the exponential if you came in read the rest of the other path you know constants of small polynomials already a constant source of policy that we don't really have to care too much about what we are able to do is evaluate the added probability of learning as a function of these parameters here in the sense that the arrow probabilities divided by this quantity in red I'm Burgess to one as and then infinity so this is a significantly stronger limiting statement that the air exponent imagery just concerned about the right but now if we figure all the other remaining parts including polynomials and constants carefully we are able to get a much more precise estimate on the error of probability and here we have to employ this area of study known as strong national mediation this is a strengthening of then the last mediation result and this goes back to Blackwell and Hodges India a long time ago because they are not content with just 3 notes it is the dislike but they 3 know if we want to go a little bit further and we want to consider general truth.

[00:16:28]
In a general 3 model we can also talk about the air of probability of learning the tree structure and we have again a very precise than that of the era probability now we need to run a maximum rate spanning 3 algorithm one thing I didn't mention the Christlike is that what happens is the out types are right though sometimes when you run the maximum width spanning 3 algorithms you will encounter 2 peaceable edges in the next step that have exactly the same weight in order to apply a result you need to toss a coin.

[00:17:03]
If you do toss a coin and ties are randomly Procrit then you will opt in or a probability of them as follows where this constant dinos. There is constant here is constant it's a function of the structure of the underlying tree model and it's nothing but a simple function of the degrees of the new.

[00:17:26]
To this constant account for the number of 3 notes of trees in the tree that contribute to the dominant Arris and the rest of all these terms here do not depend on the particular choice of the tree model or the exact structure of the 3 only depends on the parameter theta so we are also interested as I mentioned as a very beginning in setting in which we do not observe the true some post-tax a but we of the corrupt nice corrupted versions of the of the samples take instead of x. we assume that we in a while and y. are basically corrupted versions of x. and y. would be even terrible in the setting of the sometimes we may collect of the patient and those observations themselves may be corrupted or you can imagine a large sensor network in which each sensor picks up a particle observation and has to transmit that observation to a fusion center in order for for the student center to do some form of global learning the transmission may undergo some error as indicated by the skew here and instead of having to clean samples we may have corrupted samples and so we might want to look at the why since that though the nice crossover probability in this part of the top is constant across the so what we are observing are these wives where y. is the output of each component of x. when x. is passed through a binary symmetric channel with a certain constant probability q..

[00:18:59]
So the distribution of the noisy samples is then given by the following. Police here is the distribution of the gene samples and a disability samples are then passed through independently through binary symmetric channels across a low probability to the Delta acts like he knows the number of trips that we see so we can also extend our analysis for the 3 learning problem with 90 some posts and provide an explicit characterize ation of the exponent exact address and tonics so I'm not out of the ball you with all the long expressions here but I'm going to show you some comparisons to other books on the left hand side here what we're doing is we're comparing our exponent for the noise this case to Breslau and constant exponent the Breslin covers exponent in red and out the simpler the bigger the better and so what we see here is what I mentioned just not that I only said least 3 times as large and on the right hand side here we are comparing our work to that by to that of a recent work by Nikola our local Riaz and from what I said 2 years ago the left focused on the case on our exponent which is in black.

[00:20:16]
Exponent when the crossover probability is 0 point one and we can compare that to n.t.f.s. work with exponent get in magenta here so basically what's happening here is that our exponent is tight is the best possible and is significantly better than the state of the art by n.k.f. in 2019 Ok so here are some a.

[00:20:42]
Comparisons of numerical simulations to the theoretical predictions let's always just look at who I'm sorry that's always just look at the line about the line that is on top not the one below the line guys on top of the noisy place. And the prediction is a solid line that looks like a straight line and our simulations are those in either the wet or the green process and you see there's a very keen agreement here so what's the difference between the squares and the green crosses the squares are those that are obtained.

[00:21:21]
If I use the agreements as my weights for the maximum with any 3 of the the green cross those that are used to mutually information and on this site here you see that there is a slight divergence if I use the site information that my underlying tree model is homogeneous and has 0 x. the feel then what happens is I get a fly leaf I mean lower probability but when n. tends to infinity there is very seen agreement between the theory and both experiments so.

[00:21:57]
The contribution is that even for very small sample sizes of the order of hundreds we get a very simple what information series call single at the solution that can predict what is the error of probability performance of a field on the outer with a we don't need millions and millions of samples so let's get a lot more experiments to general 3 more those in which we have more than 3 notes so in the general case we can figure out what this constant And basically what we just do is beat 3 multiply the exect as an autopsy results with this constant that is a function of the structure of the treat in particular for this sort of starship there are many 3 no threes that contribute to the error and effect this many for this Markov chain there are not too many 3 not subtrees that contradict the dominant area that is that in p. minus 2.

[00:22:52]
They can be short on a corollary of all what shows that the air of probability is maximized on a star and minimize for the mark of change but we also consider an intermediate 3 structure which is this. Half a chain and half a star and you see from this picture here that there is agreement between the simulate that results at least for the star and the hybrid tree as well as the theoretical predictions that the regular predictions let's just look at the top line the noisy case the theoretical predictions are those in black and simulations are those in the process so there is keen agreement even in the even with small sample sizes we don't have to go out a large sample size and this is because we are using exec as an topics or strong much deviations in place of just error exponent on national t.v. results Ok so in this part of the talk mischa's been published in The Journal of selector a recent information.

[00:23:56]
We use our strong national dish approach to compute the exact s. and products for learning trees given 1000 eyes the phone books we obtain refined estimates of the era probably follow an increase rocket graphical earth and for the noiseless and noisy cases we significantly improved on the air exponent and error probability estimates address that as an unequal up he's a tough to argue record result in Keen agreement of the numerical simulations at relatively small sample sizes so here I keep e. fixed and grow to infinity it turns out that this is approaches from mush division approach is not very amenable to the high damaged think that he grows and piece number of knots and isn't of our samples this is not really amenable to this summer so I think that we want to try to tailor the strong national be Asians approach or some other approach to deal with the case of the high damaged.

[00:24:53]
In future I've completed part one of the talk on homogeneous remodels and identically distributed the knowledge talk about something more challenging which is not identically this to us now in my sense of that but example there could be some sensors that are very far away from the fusion center and so as the data is being transmitted to the fusion center Tom data from some sensors to get corrupted differently from others on the observations to look at the printer count again here assume that the run of variables that we care about are 0 mean with a certain binary of a and here it will be more convenient to have this convention of plus minus one instead of 10 to hit a joint distribution is given by the following exponential distribution and we promise.

[00:25:48]
To heal for a particular 3 it is easy to calculate the net or a promise just been a i.g.r. we are are related to the correlation by this simple relationship. But here we are looking at a particular knowledge model given by this probability q. and one minus q. observations are corrupted by independent non identical noice but this is a bit difficult to understand so here is an illustration though I have.

[00:26:17]
A clean varicose what I'm observing as usual advice rather cross over probabilities Q one Q 2 Q 3 need not be the same that's what I mean by no one identically distribute. Though as I already mentioned in the case where there is no noise and you given any can algorithm for learning a tree.

[00:26:40]
Where we basically just basically populate each the edges of the pairs of nodes in a tree in a complete with a graph by the end neutral information values and we search for the next mom with spending 3 Now this also works if all other of the cross overs of other noises here are identically distributed so all of Q.'s honesty and in fact char and news of the most Eric's all of them are even in this case however we're not identical noice the Charlie algorithm may not be able to identify the correct structure and here is an example of the hero I have a particular tree with these are struck with these correlations and of course this is the tree x. one an x. to form an edge x. 2 x. 3 from an age but I'm not so fortunate to observe x. 2 unfortunately the I Obs of y. to which is a slightly corrupt version of x. 2 now in this case I can compute the noisy correlations between x. one and y. 2 which is who are all one to till the 0.72 we can roll into 3 rock to 3 noisy versions we get it here and roll one tree is basically uncorrupt that because I am observing x. one and x. 3 uncorrupted and I get 0.765 So if I observe instead of x. 2 Obs of y. to the tree that I will learn is x. one x. 2 and x. one x. 3 so here there is a problem because now even if I opt in infinitely many samples I do not learn the correct treat due to this Moloch up at the mission here to the small of the put abrasion at least.

[00:28:31]
Actually ensures that we are not wrong tree. But this of no despite the Yasha and her money will propose an algorithm for passionate restructure recovery under non identical noise or different notes to use the next tension of the previous one and 2 models and I forgot about what I mean by a partial constructor this is really about equivalence classes and equivalence classes that we can learn are defined by a so what equivalence relation that I must specify a follows so here are 3 piece a set of or trees on peanuts and l.t. is the set of all new thoughts of.

[00:29:12]
The here I do not use going to graphic testing to denote the subset of needs. Nor do laws in s. have the same member and a construction is as follows for all possible subsets of leaves such that they don't have the same member I consider an option of the tree that is a pain in the judging nose in as we are caressed by me neighbor not in the origin of tree and this form if desired equivalence class is a bit too complicated I know so I have an illustration here this is our original tree model and these are the equivalence equivalence class there's basically leaves last notes there are a joining 2 Leafs So for example 15 is a leaf and 14 is a node that is not a leaf but is a joining in the.

[00:30:02]
Now I can I in general I will not be able to learn these 3 as I show you by a tree not example. But what we will be able to learn is that equivalence classes of trees in the same equivalence class as the street he's a rock and what is an example of a tree in the economic class the Basically a lot of swap but I'm focusing on this particular eclipse cluster here and that's what 4450.

[00:30:29]
And similarly for this I swap 0 for 2 I swap it with 90 originally was 98 I saw it with one here and I saw then we thought here told here are various combinations of the rest of the Nativity trees that I would possibly look at and what is the they remain a result of that the r. and caught us if say that for arbitrary noises noises that are not necessarily the same essentially just been made which is a bit law can be compressed to the fact that the best we can do is to learn trees up to this equivalence class up to the sequence clusters defined by the equivalence relation on the previous life so that's the best we can do though given that this is the best we can do we want to cry to do as well as possible.

[00:31:19]
Or the set up is the same as in the previous part of the talk where we are where we instead of looking at x. one x. p. the noiseless samples we look at y. one the y. samples which either noise crop or Persians basically describe. Why one might do y. tree.

[00:31:38]
But in this case here. We be find a learning algorithm. A mapping from and I then I and I do some pose with a set of more trees with. So nice that the 6 are independent but not identically distributed and are completely unknown to the learning algorithm we are interested in partially recovery or learning up to it clearly is closest.

[00:32:08]
And an error is the wind up the tree when are part of the learning algorithm is not a tree in the equivalence class of the true tree so that is the best we can do. And quaters presented an algorithm or passionately recovery assuming that the correlations are uniformly bound away from 0 and one and the non identical noise crossover probabilities are bound away from point 530 I mean a contribution is an algorithm that classifies any set of 4 notes into a star or nonstop though there is a something that we have to define.

[00:32:51]
A nonstarter we need to define a nonstarter before we define a star unfortunately and the nonstop is basically a set of 4 note that there exists at least one h. each Thursday when removed it splits the tree into 2 parts is the tree to do suck creased with Tula's on one side and 2 knots on the other side so this is a nonstarter because the existent it by the plate is middle if your session by cut a tree I have 2 knots on one side and I have 2 knots on the other side this however is not a nonstarter because no matter how I cut it just not be discourage this or this there was certainly one not on one site and 3 notes on the other side but it does not conform to the definition the edge of tools on one site and tools on the other side Ok so if the set is not a nonstarter there is get the Gries to be a star here.

[00:33:55]
End quote those propose a procedure for categorizing or classifying a set of 4 nose into a star or non star and it is shown here. First we compute the empirical correlations and it essence what they do is look at ratios of the correlations and compare it to a threshold Alpha and this is the is the arithmetic mean between one and Romex Romex the maximum correlation which is assumed no so this is a little bit complicated so let us try to dissect.

[00:34:32]
So we have foreigners The forms a mark of chain you know ever noisy correlations are the in order by this row till the i.j.a. then it is easy to see that row one tree row 24 divided by row one to row 34 is less than or equal to Romex horrible if I consider the following That is the denominator changes to 1423 this ratio is equal to one exactly but a very natural algorithm would be to place a crèche will between one and Romex square and a crèche was called l. far Ok though if we look at the empirical relations I would want the empirical version of this guy here to be less than alpha and the empirical version of this guy here to be bigger than alpha good is the classification scheme and want at the a in cotton is that if all sets of foreigners are correctly declared a star on star then the equivalence class is successfully recovered with high probability now when we talk about high probability we are interested in some laxity door they should disarm the complexity result that the equals class can be correct but with probability at least one minus how if the number of samples satisfies the following so let us do some sanity check s. Q Max tends to half minus for each of the observations gets corrupted in a lot of noise the learn it becomes more difficult because the notes suffer from too much noise.

[00:36:05]
Now as Robin tends to learn it also becomes extremely difficult because the minimal correlation is too small. Now so this is the result but you notice that the polynomial others here are really too large for any practical usage the can we improve. On this. We are motivated by this line of work which is very nice and unconventional because in all graphical modern learning algorithms that ups you observe the samples and as they are not corrupt but here the art of ever 70 corrupt that in any way possible.

[00:36:44]
Each know is corrupt differently from other nodes. The human our contributions consists in 3 parts we significantly improve the end Ellis' of the co-authors we also propose a significantly improved our rhythm that we caught the meth rise dramatic averaging and we also prove an impossibility was up there is a conversed result you can do no better the office contribution is a significantly improve and an assist of algorithm and we show that undercut the algorithm the economic class can be correctly recovered with high probability if the number of samples satisfies the following though the mean idea here is to use to define probability bounds or the events of the idea that I described.

[00:37:36]
This random variable here nominally has as value less than Romex square though it should be less than. That this in but round a variable here has nominally has the value equal to one so it should be much bigger than alpha however you notice that is polynomial or the 6 and a still undesirably latch even though we have improved on them significantly over the years.

[00:38:04]
That yes analysts says. So the natural question is can we do better even better so the answer is yes we can get a better algorithm empirically and provably but we cannot really improve on these orders rather we improve on a different aspect and that is the error exponent.

[00:38:26]
To our propose algorithm is as g. is thenceforth the math price dramatic averaging so it is as follows but I want it isn't as complicated as you can see here but I'm not going to describe all the intricacies and what isn't even intuition. But the advantage of the math Rice judgment of Aborigine over got the Us procedure is that it takes the complements Demetri that is the it being in there into the publication of the no indices and its in the conduct or metric mean of the empirical statistic so cut the only looks that one of these comes here and compares that to a crèche will we look at 2 terms that we can extract out from the statistics and we look at the actual metric mean the dramatic mean is basically the square of the product and we are looking at the sort of terms here as you can see in the.

[00:39:23]
To the intuition for of what. I think a prime example where these 4 notes forms an on with x. one x. 2 forms a pair x. 3 x. y.. Iraq I j Tilda represents the noisy correlations then we have these 2 relationships and to expect the following metrics based on the empirical correlations to be less than alpha It was a slightly bigger than Romex Square the main idea is that the checks condition one.

[00:39:57]
We check both conditions 12 via 2 metric average procedure we take the traumatic average of this quantity and this quantity and compare that to else that is a simple as it so as Jay our algorithm compares the dramatic average of the metrics in $1.00 and $2.00 against the special.

[00:40:20]
And we make use of a folklore theorem this is not this is just folklore that averaging cannot hurt and generally help you cannot really hurt by averaging more but most of the time it helped Ok Now these 2 statistics are highly correlated but we cannot make use of any sort of independence to decouple the sum of events so we have to resort to some more intricate analysts and there is going to be air exponent dropped so it's not just algorithm if we analyze it using the air exponent we can certainly do it and I was appalled at the details though we can get an average bonus question.

[00:41:04]
For the change Dr here but we can do it for all of us Congress and the sewing you that this can be done it is basically a bunch of optimisation problems of divergence of now. You can also be right erect bones that involve the quantities that we see with the Ok disorder to metric averaging quantities so instead of pursuing you all these expressions Let me show you a lot of the air exponent so the blue one is ours as g.e. and the red one is a ts So what you can see here is that most of the time.

[00:41:42]
Our exponent is much larger than that of that there are some small instances in which direct phone is marginally higher than ours but this is the case without noise to Mexico 0 now here to make this equal to 2 masses a lot of very at the 6 row it was a 0.724 and we see that for most values of Q Max that is crossover probabilities of observing the noisy samples or exponent is too larger than that.

[00:42:13]
Ok so this is this this flight was for the of a chile and this is for star 4 star we uniformly improve on the air exponent result of which we analyze carefully the blue cause it's always higher than that a great one now we can also look at the particular structures shot not larger through structures trap not change Ok And here the edge correlation is fixed to be 0.6 Now where you do not have any noise the child of rhythm performs the best it has 0 error probably after $300.00 samples Ok but if we consider the algorithm it gives us a decaying error probability which is reasonable but I'll give them it's much much better in the case where there are corruptions to Max's some non-zero value then Charlie fails miserably.

[00:43:10]
Is too pretty robust every probability is going down for exponent which is a lot of this and a probability is much better write. Out all those of magnitude better than copy Yes So this is the same observation to me for the hybrid restructure now for the start restructure All right.

[00:43:33]
We have a slight improvement over. Only a slight improvement and this is also predict that by the air exponent results. So in all possible 3 structures we improve sometimes a lot sometimes a little but always at least a little right so we can extend our analysis to the caution case in the cloud in case we opt for a bunch of noisy of the patients Gulshan that are basically corrupted versions of 1000 random fact the facts of a public in general 3 models using the usual procedure of populating the inverse covariance matrix and we can then knowledge to the covariance matrix and here I am adding non identical noise because the noise than adding is only to the numbers no 13579 Ok And so what we observe here is that again for than for the of our more robust the fact algorithm we get to pick any lower error probability for both a non-slip and a noisy case of over copy of algorithm so though we have at this point only talked about algorithms and achievability results that is what we can do but as an information series I'm also very interested in what we cannot do right so let this set you know the set of all possible 3 structure izing more those whose correlations are bounded by Roman Roman as follows Now that we can find this object which is of the at least a let me walk you through it so it is the probability of the output of the algorithm not giving us a tree in the equivalence class.

[00:45:17]
And I'm looking at the worst case ever a probability over all possible solutions that have a correlation bounded below by Roman and found a book I wrote next and your crossover probability for the observations of bounded of pi to me. And I'm looking at the best possible at the meter.

[00:45:37]
So that we are minimizing overall maximizing over all instances and minimizing over all possible algorithms Ok so how large is this quantity and this mean next era that we should. Be the number of doses sufficiently large and the number of samples is too small in the step too small compared to the number of note and the parameters of the model then your minimax I reckon of vanished it is always Contra.

[00:46:09]
In other words the op they want something complex the author you unravel all these are quantities is not c. divided by one minus Romex to the power of 11 minus 2 q. Max wet and Romex Roman script if we compare this to the improve analysis of. Our lives as well as Ga We notice that it's a gap Reichl in 2 children would like to improve on the gap year but currently it seems difficult to improve on all these numbers because we've already brought down these exponential on some orders from some very large scale like 24 we don't.

[00:46:46]
I think it's going to be difficult to bring it down to 2. So the main idea of this possibility result is to convert the learning problem into an Emory hypothesis testing problem and to then use a properly designed. Panel inequality to be basically construct a bunch of restructures and going to this glossed over normal prescribed and the automatic restructures that we construct.

[00:47:17]
The elements of our hypothesis test. From this hypothesis that we can apply something I found in order to get the desired minimax or so I just want to discuss this a little infant Nicol up this extreme 1000 people present a bomb of Nestle the number of samples risen required exactly restructure recovery in identical noise and what they show is that the impact of noise gets many of us that as a model look at the fact the given as follows are ever in the structural Rafiq among the selection in general we are allowing the number of nodes to 10 to infinity.

[00:47:57]
If you look at this expression in q. is not 0 or one. S.p.l. to infinity the number of those then to infinity we see the this quantity tends to run which means that the impact of noise is completely wiped out the impact of points becomes completely niggling. Horrible our result is a little bit more meaningful because it shows that the number of samples that you need is larger by a factor of the reciprocal of one minus 2 q. Max.

[00:48:28]
Regardless of the number of moves though you can actually see. The impact of the corruptions on your observations so it manifests itself in this one minus 2 q. Max in a medium x. or about. But just because you talk us into social complexity was that that's who got your rhythm for poshest we learned we proposed in the significantly improve your forethought of who coverage this can be for a boy by Alex all analysis numerical results and we have all produced converse or possibility result that is meaningful in the sense that you can if you see the options under non identically I think if you distribute noise but you will have a few days of work are obvious in the eyes of the general electorate with some information to read the 2nd one is available in the archives and these are my collaborators with both the work thank you very much.

[00:49:34]
Ok thank you for print and I suspect this is May night in Singapore halogen then I'll afraid to present this. I haven't seen any question in the chat box for many just audiences and questions if not maybe I'll start with a high level question so. In worth about recovery of the tree structures you.

[00:50:09]
Essentially have a loss of arable in this variable so Carnegie and the carnation is do extra structure. So I'm curious on the high level how this can be generalized to the graphical structure like a grass tractor or some kind of structure the grass structure. What is this you know.

[00:50:33]
Main difference of a tree compared with them is a graph structure. You mean General grass ride on restricted to treat right to the Ok. To pull the I mean the. Difference between our work and general graphical model selection that is a lot of structure from samples. In if you know this is a very long line or for the long line of the touch on graphical model selection so we're not going to lean on a post want to look at a graphical model selected for the needs within this work is that the the start of the ops are not actually from the crew graphical model we are corrupted versions of it to the best of our knowledge that has not been extensively exploited and where the samples are not actually generated from the underlying craft now on the 1st thing we crackled through the 1st thing we tried whenever we talk about some new the regions of Africa model selection is a thing called the fortress because those are the simplest models so if you're new in this new framework you're looking at tweaking the listeners are both that were generated from 3 but we don't really accept them as the of the crop of Persians of those of us and we want to say something about the structure that we don't have with we don't really have access to even if we have infinite number of course those who are trying to do some form of.

[00:51:58]
I would say double in for the cost of the of just some samples but they're probably from the underlying model and as a starting point we are looking at of course a list of very good questions often even produce more results for general graphs and so this is something that the members of our group often more about Ok For example the use of the model as an extended and yes yes yes that's very interesting so can I make another should basically like.

[00:52:30]
It's like a hidden mark of model for the sequence and here you have it interest rapture you have no interest Russia and its abjuration sacrificed is I know it's. Ok and. If it was something the it was a late entry model for this all the sort of models can be done consistently.

[00:52:53]
Look at that many years ago. But if the samples of us again caught up by laws a lot of interesting things can happen. I guess the less that interesting the method does yes yeah very interesting I think this this work enough to be important for cultural inference and life you know because you have to have multiple factors were interesting identifying the dependency structures and tree is one interaction and so it's also much resolution we could easily tell inclusion based on your inference in saying how these factors are called tending.

[00:53:31]
To get a very significant knots in that regard as well alright let's see there's is there any more questions for nowadays if not. Even a stick of time and we can take all the discussions offline to feel free to me and email and you feel you have with your question and I can make a connection with a professional recent past.

[00:53:55]
All right into a simple invitation things Thanks a lot for if we're given a talk today very interesting. There are the nation's Ok great so thanks a lot for 10 years everybody and I hope to see you next an admission or some nice.