Hello everybody.

Has Dr learning mark around and feels
are graphed on models and other things and

this is joint work with an enclave and
some U.T. Austin this dog is mainly about

an unsupervised learning problem
where the highest level work

you're trying to do is the following You
can see samples from some distribution for

as let's say a fix to be a distribution
on the break you and based on

the samples you would like to learn
some properties of the distribution and

today we are going to be interested in one
specific property Richie's specific and

important property which is to figure out
a dependency graph for the distribution.

So what's a dependency
graph of a distribution.

To high level basically a graph G.

is a dependency graph of the distribution
if you look at the random variables X.

one three X.

and and you look at two What is
this in your graph there is no

edge between those where does this then
the corresponding random variables X.

and X.

j should be independent on
the neighbors of let's say the word.

So in some sense since
there's no as between i and

j here if a condition on
the neighbors of the variable X.

that becomes independent of exchange.

You can go from here to ANY got and so on.

And at this highest level if ask you
just disappear dependency graph it's

not a very valid find you can come
up with to get answers and so on but

you can formalize this in a nice rain and
there is a meaningful way Rich

to do that I need to first talk about this
Hammersley Clifford theorem which under

some Norwegian is a conditions say is
the following G.'s a dependent graph G.

is the dependency graph of a distribution.

If and only if there's some vision there's
a conditions that we want to get into if

the probability density function looks
like this it's proportional to exponential

in some off some local functions
of the variables and these local

functions are functions which are support
there only on the leaks of the graph.

I guess you look at
the leaks in the graph G.

and for each peak you're allowed to
have a function which only depends

on the variables in the peak.

For example one very important
example is the using model

where the bubbly density function
is exactly equal to exponential or

edge exponential of some of edges
in the graph with some base

W I J X A existing and then you have
some mean field terms linear terms.

And more generally you say you have a B.

M.R.S. But De Marco random
field if in this a mission or

clique's you only look at
the leaks of size that most.

So be close to corresponds
to is in marble and

stands on our general
mark around the fields.

And these are also known as graphical
models and learning theory and and

in machine learning some people call
is in more liberals machines and

things like that.

So here's a question now I give you
a sample some A T M R S D mark around and

feel you should think of as
being small who is using more or

less already very interesting three and
four and so on.

And our goal is to figure
out the edges of the graph.

And give the samples kind of downward
the edges are and this is known as

the structure learning problem in machine
learning there's also the corresponding

question of parameter learning but you're
also trying to figure out the rates and

many are for results also extend
to figuring out the rates and

the functions and so on but only talk
about structure learning today and

that is also do necessarily
get a little weaker.

And you go to bed I mean you're learning
and this is many applications and

mission learning I said especially its
use a lot of national language processing

trying to figure out pretty in pretty
interactions and things like that.

And this a lot of prior work on
this problem which I'll get to

when I actually see the results and so on.

So the running example for
this stock is going to be

the simplest case where you have it
using more and in the most general

is a problem cannot be solved he had
to place some constraints on the graph.

And will think of the graph as being down
the degree the degrees may be constant

ten logon or something like that in the
Vedas are numbers which are boundaries or

between some constants again.

They cannot be too small or too big.

And the comparison for us is a brute force
algorithm which transcends time into

the database could just and
numerator all possible neighborhoods and

check if it's a good thing and
then so on can you do better and

in fact you can there are works
which did this before as but

let me tell you our algorithm because
it also generalists they are after and

I can start by describing
I would get on the moon.

So for each sample you know for
each possible choice of D.

neighbors.

Right you tried to figure out
the distribution exactly and

see if there is some correlation and
things like that.

So.

If I ever figure out you forget
about the graph as a whole OK.

Just imagine you.

Are good one way to solve the brute
force is our use that our.

I know.

That's probably is at this point.

And then.

Right.

Yes indifferently.

All right I'll tell you got
out of their way to do it.

Right and
to describe the algorithm all they need.

Simply had to get them as you'll see and
what I need to signal our function which

is this function which looks like you know
one or one visit to the minus zero or

the real numbers and
it has this nice shape.

And it comes up naturally
as we'll see later on.

So here's algorithm so

I'll tell you how to figure out
the neighborhood of a single word.

And when you actually
implemented you do this and for

all the neighbors all the word
this is what I focus on one word.

So what are we trying
to do we see a sample

from our distribution let's call it X.

one day.

And the way that I'll get them works
easy to do this thought experiment

which is imagine.

The I it variable X.

i Phone and i tried to figure or
tried to guess what excited is

based on the other reader bills and
begin each other.

Guess you think of eggs J.

As a guess for X..

If the guess is correct meaning
that it has some correlation

again correlation is
a ready ready big word but

would make it precise then you're
rewarded penalize the guesses.

Playing this working game in some sense.

And this game will actually be also useful
in analysis and how do you are they

are rewarded penalize the guesses we
use multiplicative rates which is.

A classical louder than learning
theory which is using or

things seems to describe
the actual outer them I need

one more definition which he is using
Marley can also be extended to M.R.S.

so I need this parameter lambda
we do think of as the L.

one norm all for the core fissions of
a single what ticks do you have for

each one day if you have
a bunch of course fissions and

this is the max norm of
the corporations at a single what.

For instance for the running example that
I mention when they got this degree D.

and the corporations are all
constants this is or.

Are it so I would say we are going to
play this game with start by initial

by also doing another thought experiment
it is we break up each word takes J.

into two word to says J.

plus and minus this is just for.

Simplicity of this point and
we think of gyp less as

having positive correlation with i j
minus as having negative correlation.

And so we need less rates less and minus
Initially we don't mean it things that

I just said them do want to do in
minus one and their actual guess for

the rate of the edge.

Which I do not.

Get the difference of these
in obvious C.S. sample X.

one through accent and we use the.

Variables to venture a guess for X.

I should be so
imagine excise hidden every word X.

gets its own weight.

And they get something and
their overall prediction for what X.

I should be is the rate of something.

And once you see this prediction we
check how good this prediction is and

here is where some work.

And that the structure of Mara shows up
and this formula will become clear litter

on but for now let's just take it for
granted so each word.

We say in course a penalty how good
it's guess is is given by this number.

And so for each word fixed J.

There's a appears there and

then this is depends on the prediction and
the actual value of exciting.

Seems a little mysterious but

if you actually dig into it there's
just expectation and so on.

Is the number.

Right yes yes yes yes.

And as I can basically condition all
the other variables except you know

if a candidate in all the variables
except excise then the expectation of X.

then the probability that X.

a close one is given by
the sigmoid of the many so

it's going to match of
that we'll see that Y.

that appears.

OK And then once you have this
last August you ARE THEY DO rates.

Using the multiplicative
rates framework so

that the idea of class gets
updated by all the values and

some multiplicative factor depending on
L.J. minus the same thing with minus L.T..

And finally we should i said we
should think of JR guess for

the actual rate as the difference
between the two but

you had to do one normalization
to give the most lambda.

But don't worry about this normalization
it's the same for every day the main thing

is that you're looking at the difference
between the positive in the negative it

this was it the norm of the coefficients
of a vector So basically

this had ideas that we're trying to figure
out there are non The noise at most.

And we're trying to maintain that
throughout the course of that and

that's it so this is the whole
algorithm Here's one way to do.

Sorry sorry that's a typo it should
have been I care I care here

this is just normalization Thanks
that's what I would assume.

And so you markedly if you
know your stick as a gate and

descent this is somewhat like a stroke
us to get in this and I get them

with a different last function but
not exactly stroke as to get in this and

that doesn't quite work out to
change it and it comes to this and

what we sure is that this algorithm
actually can work has to the true rates

read the nearly optimal rate of
convergence or do you mean so

after two to the lambda logon or
epsilon to the four samples

the rate that you have or
epsilon close to the true rates.

For every for every.

I mean we're in that we're making
of eight that the actually edges

their elder Naam is only lambda.

In the W.

I just initially sum to one.

We started out with them being one so

if there was actually one then you
could do it with this normalization.

But then there are what
the bonder degree case and.

So roughly log and samples if you
think of Epsilon has been constant and

lambda has been some small number you can
rush to the true set of rates as a car all

three if the minimum weight in your is in
model when Mum coefficient is in more or

less at least Epsilon and this Maxell
one numbers at most land you can recover

the graph in two to the lambda tense
logon or epsilon to the four samples.

For instance for going back to running
example you can record the graph into

two the degree times log and
samples the important thing here is that

the dependence on the dimension
is logarithmic so

you're learning something and
you're learning a graph of size then

using only logon samples
especially in the degree of small.

And the thing that was known
before us this is a diamond ring

right we showed that this dependence is
nearly necessary in the sample complexity.

That there is a lower bond of two
to the Omega logon by epsilon So

this exponential dependence on the L.R.
non e's necessary the only place where

off is basically in the constant and
the dependence on absolute.

Information theoretically you cannot
distinguish can be done I don't forget.

So lambda is basically or.

Is.

There only because fissions each
one is that most some constant.

So so why do it that is like you
know this information data globe on.

So even for a bonded degree it is there's
information there Diggler about it.

And the degrees D.

then our algorithm runs in it to
the bitter details logon samples and

the corresponding lower
bond of two to them.

And there's a corresponding
Lord one of two debated.

Yeah yeah that's it an example that's
what you should keep in mind although

I'll tell you one thing later on.

And.

So and another interesting thing
here is that they're running for

algorithm is quadratic So
it's basically quadratic for

each example and your process log an
example so it's all to love and square but

the case when the graph is beyond
a degree and the weights are constant.

And this might seem and there's a rainbow
is connection to Bressler moss and

sly from two thousand and thirteen
who connected this problem to our.

Learnings prosperity with knows more
generally but in particular an implication

of this connection is that even
when the graph is a single edge.

And the weight is a constant
identifying the single edge

the best algorithm we know run same
time into the one point six two and

this is due to reg relevant in
two thousand and eleven and

is actually will lead to an old problem of
Leslie really and it's a little problem

so one very interesting question is can
you get into the one point six two for

the more general case of graphs
instead of just a single bit.

And where does this single.

Rate.

Identifying that as the best end
to the one point six two and

we can do all graphs in square.

I meant like human dignity bonded to guys.

No no no no I said the best algorithm
known is end to the one point six two.

You know no.

I mean this is kind of did do something
Leslie violent posed back in the seventy's

and the improvement first
improvement from and

square as often in the dozen
eleven regularly and.

Maybe you had to wait for
the next generation of alien.

And one video.

I guess to let me say a few
words there were prior work so.

We're kind of there is sort of the
difference two different streams one is to

make assumptions on the distribution
of the zing model that you get for

instance if the model has
correlation then Breslin Walsall and

slaying of their work shows that
you can actually learn the.

Graphic better parameters a bit
around times and so on and

the line of work is to assume that the
dates have something called incoherence

and I'm going to find that this is
from machine learning community and

they also get results but these are as
I'm sure it's on the structure of

the distribution and the graph.

And the most relevant for us is this
really nice work of Bressler from two

thousand and fifteen we shored that.

Except for the third one as I mentioned
on the underlying distribution but

if the graph is bounded degree then there
is an algorithm rich learns the graph

in Dublin explanation the degree log and
sample and once again the dependence

on the dimension is logarithmic which
was the main take away from his were.

Those only using more so

if I want to say good results for
these NG more I'll tell you.

A few just a few years ago
there was a paper which

improved this double exponential running
time to sing the exponential and

the runtime as end of the four but they
also had the zero mean field as I'm sure

they couldn't handle this lean year terms.

And none of these things talk about
the games where you have more

general mark around them what if you have
a team mark around a field what if you

have four mark around them and to the best
of our knowledge we couldn't find any.

Rigorous these bridges not placed as
emption on the underlying distribution.

And once again this connection
from Breslin Walsall and

say is that learning team mark around them
feels is actually as hard as a Revell

started problem in learning theory known
as learning sparse parities with noise

so here is our result we do that if
you have the mark around the field

you can be fine and longest sort of bridge
but I'm into Lamda So earlier we were

looking at some of the one of the core
fissions or to single word pics

now we had to look at sort of every one of
all the good fishes that involve a single

word in the cover of how degree polynomial
I want to define it was not important for.

The actual.

Ideas behind the work.

It's ready and longest to lambda and

you also need some kind on the identify
ability of the graph of the marker and

feel and so you get into this way
they can be different gas will give

you the same arc around them feel
are really trying to learn the graph so

they're different grass will give you the
same arc around and feel that you cannot

learn to get three plays excellent
identifiable jazz ocean which is analog us

to saying that the non-zero coefficients
are at least epsilon in magnitude.

For Easy more so in this case we
can learn the graph with roughly

the same parameters is due to
the off times the times log on but

Epsilon to the four samples so
the dependence

on the dimension is again logarithmic and
if you think of one as constant.

Yeah.

Actually an arch your
maybe epsilon Square.

Could be absolute square but I'm not sure.

It's either a great or is in the Explain
and it's not absent of the four.

I think epsilon squared is a dancer and
there's

a fundamental reason why our thing doesn't
extend in fact it's one of the open forum

that I mentioned in getting to plans
criticizing the difficulty using this.

And so for instance if you're
in the graphics diggy diggy

bonded degree then you get to do the OR D.

to the logons.

And the sample complexity once again is
almost optimal by the results of some

Vandeman being right and they run time
after I get them of this I got inserted

by the periods entered off so
it's Increase us with the M.R. of sites.

And enter the lofty seems like a lot but.

If you use this reduction
of B.M.'s rigging and

to be here would imply breaking a long
story problem in learning theory

sparse parity with my
eyes there is conduct or

at least many people believe that you
cannot do better than into that D.

So this time it also
seems optimal assuming

the learning sparse part is as hard.

As Apple complex is almost optimal runtime
is often but under some assumptions.

And independent of work Hamilton color and
miter

extended Bressler streamer restless
information through the framework to D.M.

R.S. so
they also get bonded degree graphs.

With doubly exponential dependance on the
degree and logarithmic number of samples.

But their results actually has done
work over distributions which are not

just on the hypercube you can a lot more
states more options for each rating.

And one thing I said didn't mention
Venice talking when using MARL is that.

Our it is alt doesn't
need it can also work for

grabs Richard Knox sparse as
long as a raid it's small.

But all of the previous results near
the graph to actually be sparse

that the degree be D.

in are just the rates.

So for instance you're going to learn
a star graph where the race for

one hour and then follow it.

OK so that's what that is or is that me
now talk about I told you that I would get

them I told you the result of me tell you
what's the connection to get them and

how to learn it how Dan.

So the main lever we use comes
from this fact of what marker and

feels and second I specialize to is in
model because there seems to be the.

It gives the main ideas anyway.

So recall using model has this P.D.F.
and one easy fact

which follows from this expression is
that if the word pics are available

the probability that acts like Will Swan
condition on everything else that

has a very unwise expression in fact
is the sigmoid off a linear function

evaluated on the remaining readable of
an Afghan function evaluator on the main.

And the corporations of the African
function and director spawn and

do the rates on the edges of the guy so

that's where the signature
it up in the algorithm do.

And using this fact we kind of
abstract the problem a lot and

recall us just the starter of the
unsupervised learning problem with just

the samples and
we had to learn the graph for

some property to make it into a supervised
learning problem in the following sense.

So let VI be exciting.

And X.

bargainer all the variables
except exciting and

so I'm going to abstract over the problem
as follows Now I see samples of the form.

They're all in the helper queue and
thinking of Y.

as a label as a classification of X..

With the condition that the conditional
distribution of why has this form

your probability that because one given X.

bar is given by the Sigma where
the linear function is on the.

Samples can they learn the unknown
coefficient vector W..

Be darn proud to learn they does so

to figure out they do you need more
samples because you get these degenerate

cases where you can embed into
the other corporations a little.

Very little they can make.

One extra divided by and
expects to get some annoying things.

So this is our super as a learning
problem that we're going to study.

And.

So what we actually do is.

Find given some samples explore come
away as like like this we tried to find

another linear function you would just
close to our unknown function in the sense

of squared loss so
the sigmoid evaluator a linear function is

close to the segment evaluator on
the true function up to this delta.

So let's say we solve this is
it enough it's not clear right

because even if the squared loss is small
it doesn't necessarily mean that you N.W.

or close are that you actually
gives you some information

about the true coefficient vector W..

And here is where the properties
of the distribution shows up

in that we showed the second sort
of structural part of the paper.

Is saying that if X.

is drawn from the easing model.

Then.

If the squared loss of UN squared
loss of fuel with respect to W.

small then actually UN doubly should
be close they should be close in this

to need to distance and
here is where the rift parameter shows up.

As due to the off land and

square root of the out of order of
the error you had in the squared loss.

Is that it sort of this two steps
in our analysis first try and

solve the problem of try and
find a US minimizes the squared loss and

then there's a structural
part of it says that if X.

is coming from some.

Model then minimizing the squared
loss means that you're also close

in terms of questions and
degree that they'd imply that you learned.

So how does this uproar this
two step approach actually

carries over to General Mark around them
fields do you mean by declare you get.

If you work through the same stuff as
we do for these in Marble the kind of

distribution in the using model
had this linear function for

half the country distribution
becomes sigmoid off a degree T.

minus one palm nominal in the remaining
readable so we tried to solve the same

squared last minimization question
find a polynomial Q So the Sigma X.

minus It must be off X.

whole square it is small.

And then there is again
the structural part X.

is drawn from some distribution X.

is drawn from the M.R.F..

And the square less is small then
you can it's no longer true that P.

and Q.
are actually close to each other but

we can recover the support of P.

from Q.

That's something we can.

And this part of this the second part
is actually the most involved part

of the bit pervy really have do.

This because it also involves an algorithm
inside that it's not just look

at the coefficients of Q We had to put
together some core fissions together

to figure out the support and so on.

And so in the remainder of the talk and

mainly focus on the first step
how do I mean the squared laws.

Given this samples expire Camelot.

So let's go back to that using
model where we have this

expression that it
probably of why given X.

sigmoid often are normally found
functionary trying to figure out

the linear part or trying to find
a function you which I just model squared

loss first thing to nor does that this
is a non-convex of to musician problem.

And even it's a non-contact
optimization problem

this is dealing that is were shot C.

from two those and nine who gave
an algorithm which they called.

Them because it's some sort

of generalization of
percept try an algorithm.

And this solve this problem Palami Eltang
the show that you can find you come out

with side says this guarantee.

But the sample complexity was
an overall condition of in Delta so

it was an hour Delta squared.

So this is good if already pretty
good in the sense that you're getting

polymer sample complexity but

remember we're interested in getting
logarithmic sample complexity.

We cannot see and
samples we only have a log and samples.

So the one thing that just don't expect
exploit is that the unknown function

is sparse in the sense that it's
a one Nami is that most of it.

So that's exactly what we do the algorithm
that they describe in the first

in the beginning of the dog which I call
Spock's it drawn is kind of generalised

oppressive brown eyes of drawn now there's
an alpha transit can work drawn and

this other trance coming up and
we can find a you

with satisfies the squared loss guarantee
with the number of samples is just

square of the overall number of W M
slaw again divided by double just quit.

And this is for any distribution Next
there is no distribution as I'm sure on X.

So initially we started out
trying to solve this where X.

was coming from some using.

Model but now this word this learning
part works for arbitrary distributions.

And going back to question of you
bring the dependence on Epsilon.

This one or
don't just grid here that seems state.

So that's why seems difficult to use this
approach to beat the epsilon to a four.

Hundred W.

some unknown you have
some distribution X..

And the condition you're
seeing samples of the from X.

come alive X.

is the sum arbiter of distribution.

Where Y A Y X is arbitrary
why the probability of Y.

given X.

is a sigmoid of X.

That's it so X.

is unconstrained there's no
constant in the distribution of X.

for this part.

And I mean this model like you
know you have this problem of Y.

given X.

coming from some class it's known
as a problematic on's of model in

learning theory this indeed is the current
and ship it back in the ninety's so.

It's a good learning sig
Marson The problem is to concept model

of currents and ship you.

All right so how do we do this.

So let me describe the algorithm and I
already told you that I would get them but

let me tell you how to analyze it and
to do it to tell you how to analyze it

I need to introduce our briefly to
the view learning with experts framework

from classical learning theory so
what is the set up here let's say you have

an expert's think of each expert you're
trying to make some predictions or

trying to get some outcome of some
experiment or you know the stock prices

are going up is it going to rain today or
things like that.

And so you play the following game
with the end experts on each round

on each day let's say you're asked
expression some predictions so

the predictions for
us are just values in one minus one.

So for instance on first day it
equals one expert one predicts X.

one one X.

one two so on X.

one X.
when it.

And once the predictions are made they
incur some losses so the lot the losses

are arbitrary They're chosen by the water
which is a just numbers between zero and

one so the first expert gets
lost as one one and one two one.

And so on you play this game will
still the same thing happens on.

And one of the trying to do here the way
we fit in is that we're trying to work on

top of the experts and our goal is to
maintain a distribution on the experts.

On Lee based on the past
information rich does

as well as the best expert in
hindsight you're trying to minimize

the regret rich is explicitly
described this formula so

maintaining a distribution on the experts
on each time step the beauty and

you want to minimize expected loss that
you incur from that distribution and

the loss minus the loss of the best
expert in hindsight and the classical or

great problem now how does this guy next
to the problem of learning Cygnus with

knives that I talked about we see in
some sense force the problem in so

we call or
question we have samples from X.

come a Y.

probably did Y.

given X.

has the sigma in the form and for

this starkly in the W done on director
I also ignored the constant term let's

also do that there are non-directed Dubuis
non-negative and there's a one on one.

If not you can do some padding to get it
so initially you don't know anything so

let's just say you give a one or
and to every expert so

you don't exploit it you vision one or N.

to each.

Card in it now you see an example X.

one Y.

one.

I mean remember I had this.

Split in the lesson day minus

that's exactly what was done to do this
in the end so if they're non-negative you

can sort of hear two Carden It's
where one is the one is X.

and examine the other is minus six.

And.

So you see an example X.

one by one how do I bait
my current rate director W.

one before set into learning
with experts game ribbit think

off X X variables as
a predictions of the experts so

each garden it is an expert so X.

one eye is a prediction of expert and
then we have this formula for

the last enclosed by the expert.

You have the one half of.

Minus Y.

explaining this is the exact expression

that I showed in the first
part of the learning.

Using model this half and for you but
then that just to make sure that this

number isn't zero one so that we can use
the classical learning algorithms and

their analysis without changing anything.

And that's it so.

You have this way director you run your
learning with experts algorithm and

you have data rate director and

then you get down to the sample
of the new law says and so on.

So you get a bunch of eight vectors W.

one W.

two up to W.

D.

and in the end you are put
the one which has the best

squared laws are the least
squared loss right so

that's how you can use any learning
with the express framework to try and

solve this problem in particular if
you used to be good to his family and

your good the algorithm
that I mentioned before so

how do you analyze it the first
step which is not to.

Follow this from the properties
of sigmoid is that the squared

laws are fickle efficient director
Rich had this expiration is at most.

The expected loss incurred
by the expert in that

it's exact it's at most you know
prior between W.T. and L.T. minus W.

times.

So that's a down next to
the regret naturally and

if you just sum this up you get
this some of the square law says

is more some of the expected
losses in the game minus W.

which is the unknown vector times the sum
of the losses but remember that W.

was a non-negative director
which aired one on one.

So this inner product eases.

All of us at least the minimum loss
which is exactly that it so we just

showed that some of the square law says
that most of the good and now you can use

your favorite learning with the express I
will get them which means that we get for

instance we use the hedge algorithm to
defraud and then ship it from one to seven

which essentially the multiplication rates
to get them that I mentioned the beginning

says that you can get three get square
root Pete and log in after two rounds.

And if you use the best W.T.
Here are the losses going to be

at most the average Richie's
squared off log and or P.

which becomes dealt a facility to
be logged in the window described.

So that's the whole analysis of the
algorithm except for this first question

which follows from the fact that sigma
is a monotone function is Lipschitz so

algorithm actually works for any moment
on its function and I just ignore.

It so
this squared loss we bonded here how.

So.

Basically if you look at
the kind of expectation of why.

It simplifies very nicely because
let me say one thing that.

If I look at Sigma into a real values.

And so this is less than or
equal to him and his B.

Times there's an equality that you noted.

So it's somehow the squared loss
even though we used at the beginning

it kind of.

Disappears when you do this and
sigma of M.

and A Sigma B.

It becomes the difference
between the two of things and

you get some expression there's
no I don't have any inside who

we have during the six of the store
three lines of calculations

so let me if so that's about
the analysis of the algorithm.

And then I think the loss is so this Bard
like you know the sample convexity a log

and I would build the square I don't have
proof of there but I think that's tight.

But.

We're not exploiting the distribution X.

in this part.

So somehow you need to
exploit that I think.

For arbiter distributions if you
were asked if you were to ask

me to guess I'd say Here are the one or
Delta squared losses.

OK And so

in summary I told you how to learn a model
with due to the off London's logon or

epsilon to the four samples there's also
generalize a still learning marker and

feels there's a first sort
of provable guarantee for

the Greater than to be three and so on.

And the sample complexity in book is
This is almost optimal except for

this dependence on Epsilon and
the constants in the exponent and

they ran optimal assuming.

You make this as I'm sure
learning sparse parity replies.

And there's a problem there are several
nice questions here the first one.

There's also a lot of work and
mission learning and

war going on as Gulshan graphical models
where instead of the distribution being on

the hypercube Yes and some sort of gushing
distribution with some covariance matrix.

But all the known result for

learning spot Scotian graphical models in
machine learning even though the basically

stated bounds there are still
very good condition number.

So there's a big gap between
the information to degrees also and

the competition will result in terms
of the dependent from dependence

on the ground number and for

the learning problem I asked
I told you how to learn.

And sigmoid of linear function.

Why do you have some of two sigma.

Sigma exploits to the techs.

Can you do anything here and this is
connected to if you can do this this is

probably have a very strong implications
in learning theory in by declarer

you can learn using models with hidden
variables arrested deadbolts machines and

then how to ask connections to learning
higher death neural networks so

this seems to be the first bottleneck
in trying to do it works thank you.

Thank.

Yeah.

I mean we didn't do it for

using model wood for the ninety's and
for equal stew we did.

For the ghost of a hired B.

I think it works but the structure
lemma that I mention is quite.

Because very technical and.

I believe Richard work but
we weren't able to do it for

hire from parts more especially when these
more than the other paper does do it.

Or started again confusing
the name here so

you're asking what equals do with tire
out of it that we can do that we can OK

that we already can do and
you can ask for bed bigger T.

that it should be possible I think but
rebrand able to do it.

Yeah.

You mean for the Sigma problem or for the.

Graphical more liquid so that's known

this information there to Globe on
that you need at least to the lambda.

We get human runtime or sample from
run bang it seems necessary because.

O.O. I see what you're asking.

Runtime doesn't is not
exponential in its end of the.

Make sense it's not exponential in it's
to do the lambda into the down time.

And that seems staid because you need to
do that I'm examples and also because it's

going to change the learning sparked
by a to replace you know question.