My undergraduate degree in Ph D.

in physics and my worst grade in
physics I want tell you what it was but

was in the experimental physics class and

experimental physics is a very different
thing than theoretical physics and

whatever and to oversimplify I sort of
didn't do so well in that class because I

basically failed the final and roughly
the reason I failed the final is that

there is a machine you have there's a hole
in the wall the powers that machine and

you supposed to put the two together and
I didn't get that and

so so I tried to argue and I tried to
say well I had everything else right and

the teacher said basically the first if
you learn nothing else you should know to

check the input and check the output so
so it's good to know that I haven't.

Lost a game that skill yet so OK so
I think the microphones are working and

living a tripods after it
which is foundations of data.

They do you need to actually
do worry about what

you put in the wall because need
to compute stuff but foundations

I mean I guess the question is should
you be totally ignorant about that or

should you be doing
the foundations in the context of.

Applications in implementations and so
on and so what that might be useful so

is a couple ways to think about what
I'll be talking about here today but

that should be sort of in
the back of your mind.

So second order machine learning.

So I don't know I still
want some feedback on so

I'm not a marketing guy i do math and
foundations is this a good marketing you

know so they should change the name so
does this sound like second class machine

learning you're not so good or does this
sound like you know higher order and

super fancy machine learning so
let's give me feedback after the talk but

the way to think of this is you know tell
expansion this is zero thought of there's

a first order and second order so
you know in not just an informal way and

so we're going to be doing more fancy and
more sophisticated types of machine

learning the boil down to those
sorts of optimization methods and

this actually I think taps right
into the middle of the tripod.

Efforts of the tripods effort
you heard about in N.S.F.

call and Georgia Tech we mean
a bunch of other places.

Are setting up institutes to address
foundations of data in the ideas you want

to pick some stuff from computer science
and some stuff from statistics and

some stuff from Applied Math and distill
out the stuff that sort of most useful for

data and that's actually a little
bit tricky because a lot of

people in foundations people try and
form like the theory in a way that sort of

largely divorced from the data and that's
very different than say what scientific

computing has done over the years where
they really want to do simulations and

make very particular downstream claims and
so.

Scientific computing and
related areas think it's just obvious

you know you should be doing second
order methods and machine learning and

other areas just think it's obvious you
should be doing first order methods and

whenever I see this sort of situation
where you have two different groups of

people think two diametrically opposed
things that a body as it's engineering

space to model into because they're both
sides by have something to offer but but

you know that I'm missing something and
rather than getting into some weeds about

this level or that level of detail about
what methods are better you know keep in

the back of your mind what's going on here
right so this is phenomena of big data and

large data and so on and so
if you're developing methods and

foundations What are the problems
now that are different then what low

perception machine learners want to do
what scientific computing people wanted

what drove applied math the theory
of albums in the past and

so that's what's going to be going
on in the background here so.

I don't have a pointer which is probably
OK because I have two things to point to

so machine learning sort of inverse
problem is the inverse problem of

a long history in applied math you
have you have some for process and

you want to back out the inverse process
I'll be talking about two things

some first order methods so first
order means you know you roughly roll

downhill if you do some gradient descent
direction I'm going to more detail

about that these oftentimes
don't do particularly well but

they're very easy to implement and
think about and they're very.

Tack lots of knobs on tool and so

lots of knobs means that you can
easily fit lots of problems and

if you have lots of data maybe find a
dataset we could do better so the question

is in sort of model space is that a good
way to be working cook clearly it may be

a good way if you want to get people just
up and running and working with something.

That will talk about some and

I'll talk about a particular thing that
combines knobs in an interesting way.

Then you can say well I want to work
with a narrower class of methods

that is in some sense more expensive but

is Kwok class of algorithms more
narrow but but is richer and

better in that more narrow class and so
I'll talk about second order methods.

And is a naive way to implement this
in the naive way is not good and

then there's.

A range of non naive things to do and
so if what you

know about second order methods is
that you have to invert a has C N.

That's not the case that's
analogous to saying

if you want to solve linear equations
you need to convert compute inverse So

Page one of a linear algebra book says X.

equals be the solution is X.

is equal to a inverse B.

page two through five hundred says why you
don't do that is that a huge number of

other ways to solve that problem without
committing the inverse simile here right

so big data so the elephant in
the room has big data massive data so

we started M.D.'s in modern massive data
sets I think is two thousand and six and

a couple years later it became big
data in the data science and so

on so there's a marketing and terminology
question but sort of the elephant

the room is that you know every unit every
department in this university thinks data

science has something to do with them
right sociologists and economists and

historians not to mention you know
scientists and engineers and so on and so

this is sort of a very big societal trend
right where people are having to rest with

wrestle with this and
instantiates itself in different ways and

so the tripods thing
that it was alluded to.

You know says how can we pull out.

Foundations and theory from that so
you know genetics and brains and

a range of other areas generate lots
of data now we're going to try and

take it to the next level so you need
you monkeys data so again feedback

on the marketing but you know humongous
data so you know how do we view this and

this is sort of it in my
experience if you ask people.

What's going on what is data
science what is machine when

people get very dogmatic and often
have a sort of a territorial issue but

you know it's a classic example
of groping at the elephant and

seeing what you're trained to see so you
know you might be a scientific computing

person which has well you know I need
a bigger he said need a bigger machine or

a distributed data system that is needed
because you G.'s I'm a statistician I had

to posit a model or a machine learner
Also I tack on regularization you know and

I'll want to or all three or
all four and you know iterate.

Your albums person you
know wow it's big and for

I need fast algorithms recently this
data scientists who just say geez

it's a mass I just need to clean this
mess up and everything else is easy so

if we get one step beyond this so
I'll be talking about today is first and

second order optimization methods I'll
say why we are but this which is that

in the back of your mind that all of these
groups are coming at this question about

how to optimize and how to compute things
and you know the question is why and

can we get better methods and
better foundations and so on so.

There's a lot of people think about
these things in different ways and

I think the sort of leading axis
of variance the leading order bit

in terms of the way people think about
it should be characterized in this way.

This lives about five or six years old so
I think you know there is a valving so

maybe hopefully slightly less true
now than five or six years ago but

still I think you can either think
about these things from what for

lack of a better term all calls
that a computer science perspective

which you know is a sort of perspective
you probably adopted if you were trained

in computer science certainly
before you know ten years or so and

their data are just a record of everything
there is there all the clicks at Wal-Mart.

All the sales from some some store and
the goal is to find interesting

patterns a sort of beard diaper paradox or
you know if you're familiar with this and

finding these patterns is a needle
in a haystack sort of stuff and

it's it's typically intractable and
models of data access are appropriate for

the data and so you design a fast
approximation algorithms and

the form of those algorithms is on this
data sitting in front of me I either

compute the exact answer or I'm absolutely
good and I'm faster than the exact thing

would would take the computer so there's
no world out there this is the data this

is all there is maybe you're going to use
this for that but this is what there is

sort of a very different view of the data
and one that I think essentially everyone

else in the world adopts it's formalized
in statistics departments and

so on but they're also hold it up so
that the data are just a particular random

instantiation of a noisy process and the
reason there are so useful because I want

to learn something about the world
the data are just a middleman but

the only insight I have in the world and
so the goals not anything on

the data structure inside about the world
and to do that I chew on the data and so

I posit a model or I do whatever and
I try on the date and

I optimize something whatever and I want
to make something about the world so

these are two very very different things.

And in particular at the sort of
foundations level statistics and

natural sciences scientific computing
problems involve computation with a set of

computation per se I think
a secondary I think.

They don't in particular it only makes
sense I mean these problems out there and

it only makes sense to think of a subset
of problems problems that are well posed

and well posed means a solution exists.

And it's robust to its unique and
its robust.

Probations of the input so

the solution exists it's robust to
you know probations of the input.

You might say you know why would
I compute anything else and

the answer is you know algorithms
computer science computes on lots and

lots of problems that are not
well posed in that sense so

it only makes sense to develop problems
first write down a model and think about.

And later and you may have parameters
like condition numbers that characterize

the posing as a problem that enters
into the analysis of the algorithm

computer science is very different so

the numerical analysis was around in
the early days of computer science but

that's largely pushed out in terms of
the range of other computational X.

departments you know computational
physics or chemistry or

you know what scientific computing and
in C.S.

it's easier to study computation per se
if that's what you're seeing as an NGO in

a discreet setting you can hide solution
to halting problems in real numbers and

funny stuff like that so it makes sense
to consider the discrete problem.

And there you're talking about Turing
machines and complexity classes and so

on but in particular efficient algorithms
are about functional transformations

they're not about data right and you don't
restrict it to only well post problems

you're making statements about these
transformations and so you really

the third on divorce is the discussion
of efficiency from from data per se and

as may manifest itself in posing this
question So first on a fast algorithm and

then later see if it makes sense so
that's a general level and

before I get into these two problems you
may say you know why if I care about

doing a better job predicting puppy dogs
on the Internet or finding quasars or

hearing your cancer why is any of this
relevant and as you know in machine

learning a lot of problems are formulated
as optimization problems and

this is also convenient because you
say I write down a implicitly or

explicitly and inferential task
claim about the world and for

a subset of those I can essentially
reduce the to an optimization problem so

an example of this is you write down
the usual thing V.C. theory and

you reduce it to
an optimization problem and

any details about the problem
enter via one parameter say.

That reduction is convenient but
you know it's you know the what if you

look under the hood and don't have a well
defined optimization problem oftentimes

you can make better downstream claims so
we're going to talk about the optimization

question today and I'll describe some
theory in this in that in that country.

Which is that in the back of your mind
that I have a few slides at the end about

you why this might be a thing more
generally for these questions I think this

is a nice hides and out of the study's
Foundational Questions such as

the theory of algorithms people applied
mathematicians and so on scientific

computing people range of people on
a range of different methodological

spectrums of ask these questions or sort
of asking the same questions but in light

of the very different forcing functions
of sort of modern large scale data so

we're going to have two types of problems
one is going to be let's call it composite

optimization problems minimize a function
that can be written as after H..

F.
is nice and H.

is fairly nice but it might be not smooth
and the canonical example of this is an L.

one regulation so might be smooth here but
it's not smooth everywhere.

It could be an indicator
of of constraints at so

it's not smooth right
at the boundary there.

Too is a finite some problem in some of
functions each of which is sort of nice.

Or not but but for
you know that each of which is nice.

In the universe of optimization
this is a small subset but this.

Is of interest one because F.

can be a model and two as it has
a capacity controlling regularize or

ridge so whatever the second one is to the
canonical example of you know empirical

risk minimisation or whatever where I
have a loss function and I my overall

penalty is a sum over all the data points
that last and so that's the second bomb.

So we want so a lot of classical
optimization think go to optimization

will go to scientific
computing is you know is.

Is is if let's say efficient but
effective but inefficient so.

This to get was going to two questions
here one is going to be essentially how

many iterations do you need to solve these
are going to be iterative elements and

not sort of one off you know solve
some of the typical iterative and so

the question is going to
be what's the cost per.

Aeration.

And the second question will be
how many iterations do I need so

we're going to do two things here and
we're going to deal with both of them.

So in some sense a lot of these and
I mention scientific computing has a if

you go one step beyond vanilla
stochastic gradient descent or

methods where you really want to be very
exhaustive to where the data came from or

you have very very weak knowledge about
the domain and you want to do better

in sort of a very sort of narrow sense
you need to go beyond very low precision

computation so randomized linear algebra
gets low precision approximations unless

you couple of the negative method for
example you need to go beyond that and

that's been studied for years in
a different context and so there's a rich

area there and we were talking about this
ten years ago and about five years ago

they became a killer application that made
this very explicit this these deep neural

networks and deep learning and so you know
there you had a situation where you had.

Machine learning algorithms that were
numerically intensive not data Tencent

before that the hard part is getting
data in and now it does run over and

then the thing will work and this is
a nice you know that's not quite true but

that was the sort of general consensus and
they have a nice killer application which

shows that that's sort of nakedly false
and so if you want to compete in that area

on the applied side you need to do better
on predicting puppy dogs and whatever

the particular C five ten is but if you
want to use this as an a forcing function

to say they're at a very different point
in the foundational space and can you

pull out what's going on interesting
there so that's what we're talking about.

So I want to run
an optimization algorithm.

Some of these things we talk about
convex some of the not convex and

all more to say about that later if you're
going to run an iterative optimization

algorithm we typically want
to minimize something.

And I should say if you think of this from
the point of discrete optimization the way

those things are sometimes solve
there's certain ones that have

a very common a flow like algorithms
to make take use of that.

Structure think of this more generally
although in recent years some of those

implicitly doing it are of algorithms also
under the hood so the simplest thing you

can think about is if I want to minimize
something is that a roll down hill and

the way you instantiate this is because
I have a function value approximate

that function with a local linear
approximation I get a gradient and

sort of roll downhill so
that's a gradient descent so

very easy to do you may need
a lot a lot a lot of steps so

that each step is very quick but
you may need a lot of steps.

And since you may need so many steps you
know you can say can I improve it and

there's a lot of ways to improve it this
is a dizzying array of ways to improve it

because you can take the last
two steps on average and

you can take this you can take that you
can combine things but if I do stochastic

grade into a I don't I don't roll down
hill by get us to cast a guestimate

which my bad size with my learning rate
there's just dozens of knobs here and so

you get all sorts of variance and so
this is sort of the state of the art

what's going on in machine learning in
the first order methods and this is me.

So what's going on first order methods and
sort of the the model you should have for

these things is that you have this thing
on the bottom so you can think of this

thing as a circle stretched out and some
funny norm that's fine but you don't know

what that Norm is so think of this
thing as an Aleppo it may not be X.

is a line in some high dimension place and

if you're at some point the downhill
directions here but the minimum is there

and if only I could know that I want
to move downhill when I say downhill

I mean the best downhill the gradient you
know maybe I should move downhill but

in that direction basically the reason
is that as you know you don't

have to want to move downhill if
you can make a much larger step

in the gradient direction if you take a
much larger step and some other direction.

And so Excel aerated methods if you know
what those are take two steps of these and

try to get some to cancel based on
incidents of a dual space and so on and so

this which you should have
in the back of your mind.

So this is why people do it and
it avoids over putting in certain senses.

So challenges with these sort of simple so
you just roll downhill and I showed you

the code snippet you know anyone can the
nose Python ever can put that on a line of

code it's it will always get something
even if that something is nonsensical and

then you can you know but it you know
it'll work in the sense that syntactically

it'll work second order methods take a bit
more understanding what's going on so

there's a higher Barrett entry but
they're much richer in some senses.

So firstly the methods are very
sensitive to conditioning at that aspect

ratio of that is not ideal.

You've got very serious problems now
stated that way some people want

to say just machine learning all you need
is low perception why you worry about any

conditioning issues just stick
in some regular eyes and

all work you know sometimes funny
things going on under the hood

if you don't like that look at the killer
application there's this phenomena called

exploding gradients exploding gradients
is basically the difference between

one to the fiftieth power and point
nine nine to the fiftieth power is a lot

right point nine to the fiftieth
power zero one point zero one of

the fiftieth power is infinite one of the
fiftieth powers one so if you're a little

bit off in a stack a bunch of things
together this blows up in your face and so

they call it exploding gradients and then
you try and patch that up if you come out

from the perspective of what's going on in
terms the optimization it's just obviously

of a condition number somewhere it's not
parameterize the way it is classically.

And that you can have
this sort of problem.

And so there's some overfitting issues
that I'll get back to all right.

The other thing is there's lots of
hyper parameters and you have no idea

at all what the hyper predator should
be I mean even order of magnitude so

if you have five hyper parameters and you
don't know what they should be choose five

orders of magnitude it's this is obscene
overfitting problem and so this gobs and

gobs of people in Silicon Valley are paid
gobs of dollars to fiddle with parameters

and so that may be the best solution here
but it seems like in that whether you want

to claim here is that if you understand
a bit more about the structure of

the problem you can get rid of a whole
bunch of those hyper parameters and

do a better job out of the box and sold
mention that towards the end all right so

remedy so.

So simple first order method accelerated
adaptable get back to the mean by that or

second order methods so there's two
ways to think about second order methods

one is I have a point have a line put
some quadratic thing there that's that's

a perfectly legitimate way I did want to
think about second order and that's true

that sometimes there's some subtleties
that arise in second order methods

that aren't immediately obvious from that
perspective and they arise basically

because the other way to think about
second order methods is just a way to find

a zero of a function nothing that often
they just finally is there of a function.

So if you go to numerical analysis book
Newton's method is find me the zero

function applied optimization the zero
is you know look at the second

derivative that has seen zero and you
know what not and so apply it there but

there's a few little gotchas you got
to be careful of second are methods and

it's because they don't roll down
hill they find zeros and so and so

that's a got to the to think about how can
you certify that you're at a minimum or

at a maximum or at a saddle point or
anything so that's And this is a bug or

a feature depends on the application but
something to keep in mind so

we're going to I'm going to talk about two
things one is a not so simple first order

methods get the best of both worlds in
terms of knob combining flag and flair.

For the standard composite optimization
second order methods will be sort of

stochastic a new type don't convert
your Hessen compute approximate has

seen something I'm not going to do I have
seen three math that are L B L B F C S

are the sort of things like that and apply
it to Newton just you know a convex but

also non-convex will trust region cubic
regularization some of the things.

So there's a bunch of projects here and
a bunch of.

People involved with various
things in bold as Fred rooster he

actually has moved to Queensland and

so he's sort of been in the post up
with me in that a great job and.

Really sort of led the charge
on all of these things and

he's of involved every one of these and
so other people are in and out but

he's been so the Townsend's isn't
a great job with a lot of these things.

And so he was sort of lead.

There for these Ok so

some gradient methods this is
of the problem we're solving F.

plus H..

So that a bunch of equations here
these slides are on line and I don't

want to go through all the details of
the equations if you're familiar some of

the stuff the structural sim familiar if
you're not you know look after if you care

enough not also to point out the relevant
things but the point is for sub gradient

which is like a gradient except it's a sub
but think of it as the same you compute G.

which is a sub gradient in
line three in line for X.

K.
plus one the next step is

an argument of G.

dotting the Solomon is an alpha parameter
so this is sort of the optimization

problem that that solved directly so
you need to get a sub gradient and F.

G.
d's to casting method do that sort

of easily you need to choose a step
size this is a knob to fiddle with

it will get back to that.

To sort of limiting cases are I want
the step size to be the constant or

I want to go to zero in some way.

Logistic Regression a standard example
fits into this a lot of other examples

been to set up a practical question you
know if you have infrequent features we're

going to hear about term document modeling
this some terms that appear in lots of

documents but lots of terms that appear in
very very few you can get small partial

derivatives if you're familiar with the
natural language not state of the art now

but even in the past it T.F. idea
normalization you normalize things a very

frequent terms don't mess everything up so
there's a little bit analogous to that.

And so the idea might be here the very
infrequent features are highly predictive.

If I were a word that's unique a fairly
uncommon that might be highly predictive

but very frequent features like
the word and are not so predictive.

And so the way that appears when
you put it in optimization is that

have large parcel derivatives and so for
those directions maybe this learning rate

primer you should adjust down and for
infrequent features of small parts that it

does you should increase up so
now we had a learning rate others.

And I'm telling you that well you know you
need to have different rates on different

features as even more knobs here so
you replace Alpha K.

which is a number in fact a sequence of
numbers with a scaling matrix and our mass

prop Adam there's a whole bunch of things
that follow this general approach the.

Same algorithm compute G.

form a scaling matrix ass and solve the
same sort of problem except if a wet S.

wedged in between S.

could be anything but

if it's a diagonal if it's a sufficient
roughly as the previous algorithm.

And so this type of guarantees you
get without a grad would be F.

as my objective and
it's better than root to dimension D.

infinity alpha over rooty.

As opposed to a sub gradient descent
method where you're better than D.

two over routine so let's unpack this
there's a constant that is now about that.

Alpha is some number D.

is your dimension lowercase D.

and D.
infinity is a max over infinity Norm's or

a max over two norms of
the size of the set you're

dealing with the other constraints set.

So you know these are in comparable ones
not going to be better than the other they

both have the one over root to you but but
you know the dependence on dimension on

the geometry of the set in other things is
quite different so the competitive factor

the D two and infinity capture sort of the
geometry of the such a trying to optimize

over an alpha depends on something
having to do with its computer bull but

you know something having easily computer
would do something having to do with.

The grades that you've seen as you run
along there variability let's say.

So what we've talked about there is one
way to start to de-couple the geometry

from other parameters
that are going on and

that's what I want to have a method that
combines the best of both worlds and

there we can start to get a handle
of the geometry of the dimension but

we said nothing about the number of
steps a different type of knob and

the whole the line of literature says
I want to improve the tedium and

dependence I don't want to but
that's I just want to prove that.

Dependence And so
let's let the same set up and

I want some great methods and sub great
methods of give me one over rooty there's

something called stuff which is a bit
better and it gives you the one over tea.

And then the F.

on fist is fastest and so it gives you
want over tea squared and so the idea that

I want to get a certain quality square
root of whatever one thousand and

versus one over a thousand versus one over
a thousand squared is different and so

the bottom ones are going to be faster and

you can easily sort of convert this if
you're more comfortable with not at this.

Rate but in terms of epsilons and
you get an epsilon upsilon Square.

So here you are talking
about improving the T.

depends but the way to think about is
this first order methods and once a G.'s

we have these problems so tack on a bunch
of parameters to control learning rates

the other says G.'s e have these problems
and do a stand fist and have a bunch

of parameters to try and accelerate things
the way second order methods would look.

So Excel array to methods do that

they get better learning rates adaptive
methods do that they get better constants

that are depend on the geometry what's
going on sort of those are constants or

not it depends on what you think dimension
is a constant Can you get the best of

both worlds and so just an inch of
time not going to too much detail but

the short answer is going to
be yes the flag and player so.

Day and night leg and save your life so.

We have the the solution for you so

flag is going to be a fast linearly
couple the depth of gradient method and

player is going to be a better version
of that word for last and certainly So

there's a range of ways to think about
these accelerated methods the original

approach of Nestor all of us had a very
algebraic And I think most people thought

it wasn't so intuitive this but
people recently have tried to relate it to

continuous time partial differential
equations one of the ones I like

the best was Allan's one direct Yeah I had
sort of a way to think they call that.

A I forgot the name but you combine
information in the in the primal in

the dual in a certain way and so we're
using exactly their linear the couple

techniques they use information in these
problem we do and couple them in a certain

way finding the parameter with with a
bisection search will get back to that but

we're going to adopt their approach if
you happen to know that so flag OK so

now you see there's a basic thing that
almost everyone can implement and

when that doesn't work
you take a few lines and

it doesn't work attack if you're the one
that has or could put in a few more lines

so now the code getting
a little bit more complicated.

Why K.

is step three why K.

plus one is some approx step
you get a gradient mapping in

the next step you form
an ass Step seven Z.

K.
plus one that is of the same form

as the problem we saw earlier little bit.

If you would call a routine you can
call the same routine with that and

then step eight you do you do some linear
coupling with approx is the usual proxy

valuation which generalizes
the projection you know Matrix idea and

convex at say so this is flag so flag
is good you know by certain standards.

And so the standard is do you want
to evaluate yourself as an M.L.

person or a theoretical computer science
person optimization person but by one of

those metrics flag is good for the bird's
eye view of legacy do the usual gradient

you form a gradient history of scale
something when you combine something.

So it's a very particular
way to combine it but

it combines it with the goal of
getting the best of both worlds.

So the theory is.

Flag will give you dimension the Infinity

time some fudge factor beta squared and
Pfister will be D.

two you know the T.

squared just as D.

two divided So the convergence right there
both started to look second order so

they're both good so
I don't know if this is better or

worse because we've introduced some knobs
you've removed some knobs so let's let's

look at that there's a competitive factor
the analog of what you had before.

This is one depends on the geometry and

the beta dependent a slightly
different on the gradient history.

So the linearly couple and so so if you're
interested just an iteration complexity

and one way that machine learners try and
prime A tries the complexity of

an algorithm is just the can the iteration
count not how expensive is each iteration

is assume each iteration is cheap
the number of iterations so here you need

to couple the two things with an epsilon
binary search so the Allen joiner probably

appeared Fox or struck I don't know
exactly where it appeared but the sort

of ANY if that's something they didn't
worry about we're going to ask a question

about that find an epsilon approximation
to the root of a certain equation.

How long does that take most log want to
over a second last bullet steps using

bisection and at most two plus log one of
the ups and proxy valuations perforations

so is a lot of going on here what's
expensive in the whole set up

because if I write a paper and I say I'm
faster than you I implemented your code

applied into a particular good job
implementing it and put them into my

code I was very careful to a good job
implementing and so I'm better than you.

So that.

Doesn't seem like the best
of things right so

a way that is common not in
statistics not in album but

in fact in scientific computing is to
evaluate the complexity of album by

the most expensive step the number of
matrix vector multiplies let's do the same

thing here with the number of products
of valuations nothing else matters

that's the dominant cost everything else
is second order no pun intended I guess.

And so we're going to do a linear
coupling flagged as a certain thing and

it takes some number of steps.

And so in particular to proxy valuations
more than a fist so now everything we

talked about before is interesting at
a reasonably high level of theoretical

granularity now we're asking you know
something of a finer scale question.

So.

At most two so most two prox of I was
more than pissed is this a lot or

a little I don't know how you
implement and tell me right so

it turns out to be a lot and
so now the question is can we.

Not reduce this the M.L.
problem to an optimization problem and

then reduce this to a bunch of knobs that
then we could combine let's look under

the hood of those proxy valuations and
what we're actually doing right

we're doing a linear couple of different
things and if we look under the hood here

we can do one of two things if we ask that
of a worst case question we can just say I

want to guarantee that I'm going to be
good on the next step that cost something

like what we just talked about or
I could do something like It's not exactly

something like an iterative refinement
do something minute early corrective

now in worst case you don't have control
of how many iterations this takes

the answer is it takes two and you know
you can find something worth worse but.

You are in fact you know correcting it so
flare has if you do that than player.

Well you do that it or of correction well
have exactly the same number approx of

valuation is Fista So the dominant
cost we've decreased down that too but

it's going to similar theoretical
guarantees to flag so best of both worlds

compared to festa I'm going to gloss over
these empirical results not because I

want to spend time just talking about that
stuff but I want to go in a little bit

more detail of that's on purpose else we
have at the end there are a lot more for

looking but suffice it to say that
the claims I was making are you know

can be borne out and we did it you
know fairly nice job evaluating that.

So this is interesting because
it's a point in parameter space

we can make first order methods look
a little like second order methods and

we have a good understanding what's going
on with the Jam which are the constraint

sets vs these various knobs so now can
we say can we do with just a lot fewer

knobs and the reason I wanted with a lot
fewer knobs is in some sense if you have

you know a million algorithms and you have
a million data points the cross product

you're bound to find a positive result
write this in the way set assertions try

to control this with false discovery rates
in the analysis question here what if you

have gobs of knobs you bound to find
some premise setting where it works but

it's not clear that that works in
some cross validation sending but

it's also not clear how to put that
into a cross validation setting because

this isn't a model it's an algorithm and
the album's implicitly implementing

a model of whatever it implements but
because it doesn't fit cleanly

into the usual cross validation settings
that you might think of in statistics.

So let's shift gears now we're going to
look at minimizing the finite some problem

and.

We're going to get a couple things going
to look at sort of stochastic Newton

type method second order type methods.

It applies to whatever but
think of it as convex and

it's going to be the usual theory
if things are convex and such and

such a way then I can introduce a guess
the city of the second order math and

good things happen then I'll talk about
trust region and cubic regularization.

So the formal second order
methods is if you're convex and

you choose all your things properly
you'll iterate you'll find a solution.

What if you don't know what they
are convex lots of examples

in scientific computing are that way
neural networks are not obviously that way

in a range of other things non-negative
least squares a range of other things and

mages completion what topic
modeling lots of things are convex

not convex is a big place right in
lots of stuff can happen and so

from this perspective what we want to
ask Is the are we at a minimum or not or

are we stuck at a saddle point and if you
look at a plot you know and this is your

brain in a calculus book you know how
can you not know right you just look but

if you had a million dimensions it can
be hard to answer that question and

so would trust reading cubic
regularization is either at a cubic term

which can find us at the center action or
you have a region which you trust and

solve a related problem so
that when I get value problem so

this is a hello world and applying
these to cast a second order methods.

We can do the same thing with flag
inflatables that have taken our methods

in non-convex pipelines.

So second order method so second order use
gradients and has since I think I think it

has it just sort of mentally just much
more complex objects for people I mean

such as part of what's going to Berkeley
this is to stay to science division and so

on I'm involved in teaching sort of
a freshman level class maybe some cycle of

a class on mathematics of data and that
involves some linear algebra probability

and optimization and the way it's
usually taught as you do calculus and

then this little bit of multivariable and
then curls and divergence is because

that's what engineers care about and then
later you might learn about linear algebra

but in this stuff you know I want to
start with the linear algebra Lialda says

these intuitions you have about are two
all generalized are million but

in fact are a million behaves
in very funny ways it's totally.

Different than our two and our three
the low dimensional and you can understand

that in terms of measure concentration and
some coin flipping ideas right so so

I think if you thought of that way
to be clear but the way to the top I

would have seen that is conceptually much
more complicated objects which need to

have some idea of what a matrix is or
the birds used a lot of subscript chasing

a move to fast convergence rates just
the structure of the algorithm the thing

you're doing itself is resilient to
build conditioning they can over fit.

In the sense that if I want to machine
precision I'll get machine precision but

they overfit in a very structured way you
know it's not that you have a million

knobs and a whole braces they were fit a
very structured way which in the space of

sort of models which you would albums
implement might be a good thing.

But if you do it in general but certainly
naive the perforation cost is high so

could get around that.

Let me skip that because we do want to
talk about the slides at the end so

second order method so deterministically
approximating second order

information cheaply large body of work you
know read a notice that on write read and

there's a bunch of books on this there's a
lot of great stuff quasi Newton method B.

F.C.S. the limited memory be F.G.S.
and so on.

When we did randomized linear algebra so

I had the good fortune to bump and arrive
at the tell on the my dissertation and

moved in this area and they were working
on random sampling algorithms and

it seemed that the particular album
themselves weren't going to be

competitive when rubber hits the road but
the ideas are very different they could be

in fast forward some number of years and
the answer is yes you know the best

blend pick can be tell a packet can use
randomized linear algebra to beat L.

A PACK walk time and

so then responds as we can be better then
there's an arms race but you know you're

going in the arms race you're not even
you know you're not factors of poly and

Morris so
can you do the same sort of thing here so

we want to do is randomly approximate
second order that information.

Do we care about machine.

Do we care about conditioning
how do we care about it.

The answer that question is
you know I'm going to punt on

because I'm going to care about it in so
far it matters for

the foundations of data we're
doing this tripods Institute.

Should I worry about the same thing.

Was wary about when they were developing
the I Tripoli you know floating

point standard.

Is that stuff useless now are some nuggets
of it useful or not I mean that's a.

Long long hard question and
long answer but

we're going to worry about these things
in so far as they are useful for machine

learning and data analysis applications
these days that are broadly defined.

If you do that then you should ask
yourself we're going to randomly

approximate something where is this
randomness so if you ask people where is

randomness a statistician would say well
you know randomness is in the data they

have a model and
the model is you know why is equal to X.

plus noise in noise or whatever so

it's not random in the algorithms
rends in the model and

then you call it an album as a black box
if you ask of the computer scientists

where the randomness is the gold but there
is all there is there is nothing else so

there's going to randomness it's got to be
in the algorithm to choose on the data and

what we saw in randomized linear algebra
although it was harder to see because

that sort of structure place and

what you see here also is that the
randomness in the algorithm considered Gys

well at the randoms in the data and
can implement regularization implicitly so

I'm not going to talk about that
because that's a finer scale thing but

that's as you should be things were going
to be randomly approximating second

order information and you know we may
not want to solve the problem we say we

want to solve because we want to solve
this implicitly regularize problem and

so how does that map to these
conditions and bridges and so on so.

Well subsample to has and has been a bunch
of results in recent years on that

Martin's did it a while back
actually in the neural nets

application that very he was
well ahead of the curve and

it's one of these things that sort of
those just the stuff that came out but

that's actually nice results if you're not
familiar with the sketching the hacienda.

Subsampling has scenes and
gradients a range of things here.

So I'll be talking about some of these and

the idea here is that we
want to solve something and.

The idea is they want to find that
minimum point and we're going to run

an editor of algorithm and you know
if you have a rough estimate if you

get the exact rating the exact test C.N.N.
your quadratic it takes one step and

if you're not quadratic
iterate at whatever rate.

What if I get a noisy
estimate of the gradient but

if it's an extremely good Very little bit
of noise what if it's extremely bad but

marginally better than random So this is
the tradeoff point how what's the size you

mean a batch How do you
control variance and

so I want to show that we get similar
guarantees that you would in the global

analysis case in the deterministic
case with randomness so.

We can subsample has scenes and gradients.

Let's consider the convex case
you know the usual set up we want

to design methods for machine learning
that are not as ideal as new methods but

have the properties first of all the turn
towards the right directions not here but

there they have the right length the step
size of one is going to work most

of the time and it should scale
up in some meaningful way so

we want to scale up so we're going to
sample gradients and has in the size of S.

has to be independent of N.

the number of data points or dimensions
features turn in the right directions

turning the right directions
means don't go downhill but

move towards the minimum in the house
instance that's of the natural structure

if you second order information and so we
want has been to preserve the spectrum of

approximation to preserve the spectrum
of the hacienda as much as possible and

that's boils down to subspace and better
ideas and from randomized linear algebra

from the with that not ideal but close to
going to get a fast local convergence rate

close to that of what the much
more expensive note would be and

the right step length you don't have
these decaying step lengths and all these

other hyper parameters step size one work
so maybe it works after initial burn so

I want to scale up return in the right
directions the theory is going to be.

You know if the sampling size ass is
larger than something then you know

probably one minus Delta has C N
approximation as with one plus epsilon.

So if you're familiar with this that
equations the obvious thing and

if you're not familiar with
that has the has the N.

H.
is the approximation Del squared up

as the has the N.

and there with an epsilon
of each other and

the rest is dealt as an absolute So we
get a good approximation of the Hacienda.

Fast local convergence rate.

So.

If you stop the sub size to
be one with high probability

from one step to the minimum you have
a linear in a quadratic term and

this is a very nice property you can trade
want to get stop the other were not as

you know the primer on the linear
term is a problem independent so

I made arbitrarily small and so.

You get what's called Q
linear convergence and

you can also get super linear rates as
well for certain in certain settings so

it converge is the best
way the way it should be.

Right step length unit
size eventually works.

Uniform subsampling this is
a time when they gloss over this

this will imply that the right stuff
lines for that eventually work.

And we can do it with
an inexact update let me be.

Local and global so if you're familiar
with the usual convergence properties.

If we choose the sampling of
the gradients and or Heston's the right

way it'll converge if you're an expert in
optimization there's local versus global

convergence and so there's some
subtleties that this is watching over but

but that's the first order bit and
so now it all points out

very nicely any optimization algorithm for
to the unit step length works

it should have some wisdom right it's you
not giving us all bunches of knobs you're

getting at a very clean way with it so in
a sense it's the right way to overfit And

if we're going to do early stopping and
want to characterize an implicit

regulation or over stopping and
sort of the right quadratic well.

So these are all convex Now
let's deal with non-convex city.

In the last whatever five minutes or
something so.

Points local minima local maxima and
things get sort of very wild.

Non-convex problems you know
what does that mean right.

It's a big space right so

just saying something's not convex the
saying here is convex and separate thing

else so it's not a strong statement and
so you've got to do is carve out bits and

pieces that are causing convex or in vacs
or have some of the structural property.

So that's going to be looking at.

And the way the you want some
notional let's call it epsilon G.

epsilon H.

optimality where the gradient
is small less than epsilon G.

In the end the lambda min of the has
the ends you want all of these to be

positive and so
maybe want to slightly negative.

So to deal with this this trust region
methods trust region method say basically

don't choose a direction and then do a
line search instead choose how far you're

going to go and then where exactly to go
and it's not on a line it's in a space so

they're actually very different than line
search methods if you know lines and

methods and cubic regularization
methods they basically look at the next

order term in the Taylor expansion and

if you near a saddle point it help
you find a descent direction.

So to get iteration complexity this
is a lot of technical thing to get

ration complexity previous work required
number one where you know you have scenes

are proximate in the right direction with
a quadratic squared thing there this is

stronger than the dentist more condition
if you're familiar with us from from

optimization you can relax one to two with
a has the INS are good it's that without

that square who cares right this
seems like a wee level thing number

one is a much much more convenient thing
to do the analysis is much easier a lot of

people work with something like that
number two is actually a lot harder so

we have fifty pages of math
here to justify this but

quasi Newton method satisfy two and
not one but in particular

sketching subsampling methods from
randomized learning how to satisfy two and

not one so we were able to operationalize
this in a much stronger way

because we satisfy condition two and
not one and so we have this set up and.

Let me be.

Gloss over that and say that.

Both Let me just say both for
trust region and for

the Cubic regularization would get
results that are either the same or

slightly worse than what you get with
the just the classical deterministic

methods in terms of I'm
finding the right direction

in terms of getting out of saddle
points and finding the local minima.

So.

All right.

Good math sounds nice.

Here are some preliminary
results the preliminary results

of the end of the slide because we
actually have a a bunch of other results

that are here that motivated some of
these questions and so now we can say

given classical truss region classical
cubic regulation class of convex theory

the leading order bit as we can reproduce
a bunch of that with stochastic first and

second order methods we did to the first
as she does gobs of work on that

we can take the best of both worlds but
for second order methods.

The running time these are very practical
things this is not a theoretical claim

there's a bunch of but

these are very practical things in
the sense of factors of two in the number

approximately missions.

Who cares why why would one want to do
a second order optimization of machine

learning applications in some
cases you have built conditioning.

When you have built conditioning if you
do a first order method you're purple.

You can run two hundred fifty I could make
the two hundred fifty thousand and still

be up there first of the methods you know
by the time you get to fifty you're black.

Or red depending on details of
how many of the sample sizes

you might be green green is Newton green
or find it but it's more expensive so

if you're better black that's better so
if you're doing the problem that

is ill conditioned This is not a neural
network with exploding gradients but

that's an example of one
that is you do better and

if your scientific imputing person look at
that and say yeah obviously but you do OK.

My machine learning problem is that it
isn't ill conditioned Why should I care

fair question it's a.

Good generalization parameter so

the way that this works and
people fiddling with not to lob

is you get a lot of convergence properties
a very practical neural network albums

that are just qualitatively inconsistent
with what convex theory would suggest.

And so this allows us well
that's a good generalization and

robust type of parameter tuning so

the thing on the top in a bunch
of these other things are first.

Or methods that perform very poorly or
full and

out second order methods
almost out of the box.

Will perform well and you'll be blunt
you'll be down on the black in the red

in the blue and the other ones you had to
beat on to try and find the right values

the hyper parameters in even this very
simple set up ability to understand and

escape subtle points phenomena you see
a lot is you get a little bit better and

just flatten out you're on forever and

then boom you get better I mean why
I change one little primer some work

gets better chance for someone else don't
get better may or may not you know so.

A lot of that has to do with
understanding the structure of basins and

saddle points and and here you know
this particular one the yellows and

oranges are momentum based methods you
go down to flatten out you run forever

this particular set up the initial burning
was a little bit slower but then you hit

the basin but then you're able to find
the corner case work out better and

avoid that if you're interested in
distributed applications you can implement

these things with low communication
they're the communications a bottleneck

not the computation and you can implement
that and distributed settings you know

we have that we've done randomized linear
Alderman terabytes of data not this but

you know I this will go through to
that and so you can plan to deceive

the settings if you want to put it on G.P.
use and argue about factors of two and

running time and so on you can put
on G.P.S. This is unpublished but

we have this out so this a bunch of
preliminary results we have at the end and

I think I was here I don't know
when it was maybe five years ago so

plan back here before then but
whether it's Than or

before that I can tell you about the how
the preliminary results panned out but so

right now there are a lot of preliminary
results but I think it's what we're

setting up the framework to do a bunch
of first and second order optimization

methods analogous to we do the randomized
linear algebra at the heart of so

using these that is the heart of
the foundations of data discipline

saw That's a very practical downstream
problems so with that let me.

Wrap up and I'm the leading act so
stay around for the next hour.

Thank.