[00:00:05] >> Thanks everyone for attending today's machine learning the men are today we're fortunate to have Professor Robert Novak from Wisconsin Madison to give the seminar. So I've started with a very brief introduction and then we'll start to talk and during our task these Feel free to send your questions at the question box there's a q. and a so outlandish and a question case there's anything you want to ask. [00:00:34] All right so Professor Robert Nowak is a fall and that's come Professor of Engineering at University of Wisconsin Madison where his research focuses on seeing prophesy motion or any optimization and statistics he is a professor in Electrical and Computer Engineering as well as being affiliated with the departments of computer science the to 6 and the medical engineer he is a fellow Also I Tripoli and Wisconsin isa tells the discovery a member of this concept of meditation receives consular d.m. and the show many at this cousin is also a adjunct professor at a tell you that used to tilt technology still at Chicago Ok So with that much but there are dual edged Welcome Professor up in the box give a wonderful talk where trip was. [00:01:26] Really break you thanks for that great introduction thank you to all the organizers so I'm going to talk about active learning and I want to give you a little bit of a tour all the way from you actively running when you're fresh hires to all the way to learning overprint Trice neural network classifiers. [00:01:45] And I'll kind of explain everything as we go and I just want to mention this is joint work with Mina Carson who's a post-operative other post back with me she's now at the Toyota Technological Institute at Chicago I should be joining u.c. Davis in about 12 months from now and Rob will tare who's a graduate student at university was I assume I swear I transitioned properly if not somebody speak up and let me know so this slide here I generally 1st of all introduce and motivate the idea of active learning bill which again I want to talk about the conventional what we call passive machine learning pipeline and the idea here is that we have a large set up on labeled raw data we're going to ask a human expert or group of experts to label that the a for us and then we're going to turn that labeled training set over to a machine learning algorithm which will produce a predictive model and so I think everybody is familiar with this pipeline but it would look something like this in the context of image classification we have a bunch of images that we just collected on the Internet sites there I'm a mom so we're going to have a human expert label these dogs Bostick that truck and then we're going to train our machine learning algorithm maybe it's that deep neural network parameterized by some weights w. that yields a function absent w. that we can use to predict and classify but labels up new unlabeled images in fact. [00:03:15] This is been a great success we all know about it we can have computers recognizing images as well as humans they can translate from one language to another they can master games better than humans but under the hood of all these engineering accomplishments there's a lot of human effort in many cases So for example millions a label images from humans in the case of image classification in text translation they're using bias of tax for example from the United Nations or something that has been painstakingly translated by humans and then the machines are trained some print on those translation pairs in a game playing the game by the machines are trying working and even my son could suffer so there's a big kind of difference between maybe how machines are learning these days and how humans like my son learn and also there's a huge human effort you know a lot of these design of these machine learning systems and so we'd like to think about how can we brain machines with less labeled you know less human supervision you know this is a super active your research there's a number of different approaches ranging from things like help supervised training to things like active learning and I'm going to focus on the active learning Yup which I'll illustrate it with the same kind of scenario again so this was the original pipeline the idea of active learning is essentially to add an additional component to the machine learning system call a data selection algorithm and what this component does is it looks at the current model learn by the machine learn to use that model to decide which examples it would like that human expert to label in other words the machine learning algorithm is for asking for help maybe it's very confident in its predictions of certain examples but have lost confidence and and others answer on those less confident. [00:05:14] Examples that it might ask a human expert for a label the I.D.'s machine automatically adaptively some lacks the most informative examples for labeling and this reduces the time and effort of human supervisors in this training process Ok so I hope that's clear to everybody I'll kind of illustrate it in a minute with a concrete example but I just want to point out that. [00:05:41] You know I'll mention I've been working on this sort of problem of active learning for you know maybe almost 15 years now and. When I started of course it was something that was really mainly in the academic world largely theoretical but more recently this is really come up in New York services that you can purchase for example Amazon has a service called Stage maker ground truths which does exactly what I was saying data is only route to humans in the act that learning the system if it can't accurately label itself Prodigy is another company and service which uses active learning to annotate large data sets so this is something that's you know a real real issue a real challenge him modern machine learning systems and that is how do you get the labels in your training data set maybe you should do something systematic inefficient it more. [00:06:38] So here's one example that we've looked at a little bit at Madison and so this picture emphasizes one feature of active learning which is that it's sort of a closed loop approach to machine learning and in this particular application the idea is we want to train a machine to automatically look at electronic health records and then decide whether or not a person is likely to get a disease or maybe even has a disease based on all the information in their e.h.r. so human experts need to provide labels to the machine learner a labeling of just a single e.h.r. could take several minutes so it's a very time consuming process also the experts get paid you know $1200.00 an hour so these are very expensive labels lacked but that's the idea human experts label these electronic health records were a subset out there and those are turned over to the machine learning algorithm the machine learns a rule that rule can about be applied to new unlabeled electronic health records and then the act of learning from play is to select from that subset of unlabeled records some that it needs help with and turns those over to the human experts for labeling don't keep illustrate this a little bit graphically which just suppose that we had to electronic health record features feature one and feature 2 and I'm just focusing on 2 so that we can visualize it very easily and in this kind of picture in the upper left what I'm showing as examples so each dot has a different electronic health records located according to its true features the grey got her on a label electronic Chocolat records and the red and blue dots are ones that have been labeled by the human experts for example maybe red indicates disease and blue indicates healthy. [00:08:26] The given those few labeled examples there might be a best linear classifier like I'm showing with this dashed Green Line and the way learning and training will proceed is that in passive learning this system will just retrieve another unlabeled example at random from the pool of electronic health records or maybe this dot that I've circled in purple here and it will repeatedly do that just pick another random example ask the human experts later let's And we'll get some sort of performance out of our system as we get more labeled a or as machine gets more labeled examples that the test error rate will go down and that's what I'm showing you down in the lower right this is the actual test error decay as a function of the number of labeled examples in a setting where we're trying to predict whether somebody has cataracts or not based on about 6000 electronic health record features and we're using linear classifiers just as I'm sure of depicting up in the upper left to what is active learning to active learning says well maybe the system should try to decide which examples it's least confident about and one here instinct for that is to look at examples that are close to the current decision boundary so I've circled one of those examples in orange here and this is how active learning would work it would select a sample like this ask for a label update that linear classifier and repeat and what that can do is drive the error down much more quickly in this particular example we saw about a 3 x. improvement so we can learn a very good class higher using 3 times fewer examples with an active learning method as opposed to a path of learning but that that can be a pretty big gain in terms of the time and effort required from those human experts to train a good class for for this task. [00:10:19] So I'm just trying to have mention one of the challenges for hurdles here are 2 active learning it sounds great why don't we just do it all the time well there are really 33 reasons 1st of all active learning is combining data fitting and data collection and this requires sort of an interactive computing architecture and so it's it's not as simple as just coding up your favorite algorithm in Python and applying it to a fixed data step that dataset is being curated as you go and so you really need a kind of real time interactive computing talk or system and we and other groups have developed systems like this we have one it was concept called Next was it which is an open source opera system that helps people prototype active learning algorithms The 2nd challenge is that as I mentioned this is a closed loop feedback system and that means that algorithm design and analysis is a lot more challenging than it than it is in the traditional passive learning setting and then finally in this is something I'll be talking about later in the talk is that the strip data or example selection in active learning is based on the particular form of prediction model that you're considering and so because data flexion is based on the model specifications if there are mis pacification when you have inductive ice in your system for example you're choosing to use when you're classifiers if the when you cross Meyers or syrup not very accurate models for the problem at hand you can really run into problems in active learning trees you're you're you're trying to leverage that structure so heavily that I've just had a famous quote everybody's probably heard essentially all models are wrong but some are useful by George Fox who is a professor here at Madison and what I'm saying is that that problem of models being wrong is is especially problematic in the setting up after learning. [00:12:15] Ok so maybe I'll just pause there I don't know if they're in clarifying questions have come up in the chatter a thing I could take them where I can just keep Go on good well continue and again happy to take questions as we go or and at the end so to understand their complaint maybe it's good to just start off at a very very simple setting the simplest perhaps and that is learning a one dimensional binary classifier to what I'm showing you here are examples in a one dimensional feature space and those are unlikely most of let's suppose those are the labeled examples now we have labeled Dana and I asked you Where should your decision now we can just see it right there and she got right in between the red samples in the. [00:13:05] Don't How do passive and active learning compare in this very simple setting well. The active method that you would naturally consider here or something like binary search and that's what I'm showing over on the right and I'm comparing it to just randomly selecting examples for labeling on the left and so what you can see is after a small number of samples the binary search is really localized where that decision should be where is the random passive sampling has a lot of sort of uncertainty left in it about where that decision by 100 should really be placed and so in this setting it's against very simple but we can see that binary search quickly finds a decision enough I have and labelled examples that the error rate passive learning will decay like one over one out of the number of labeled examples where as an active it sticking exponentially with the number of label samples so there's this huge exponential speed up in active learning compared to passive learning and that's something that will will see holes more generally in certain cases. [00:14:16] So to kind of generalize this idea of binary search and so for it to higher dimensional function space higher dimensional spaces and other function spaces there's a nice idea which is called disagreement based active learning and I'm going to illustrate it in the context of learning linear classifiers and to begin with we're going to look at a situation where our training extend closer uniformly distributed in the unit ball and we're going to look at linear classifiers passing through the origin so there is a linear class of higher passing through the origin linear class fires just one of the ball into 2 hemispheres and we don't know which one linear classifiers That's right it could be any orientation so the idea suppose that we had some labels examples like the ones I'm showing you here with those were able examples you could say well which are all of those possible in your class fires or orientations are consistent with these data and so it would be linear class whereas like the ones I'm showing based on those data we we we could accept any one of these as a good separator the label examples we've seen so far. [00:15:31] Then we call we can use that to sort of. Partition the data space into 2 components a component where all the classifiers agree and that's the grain region you see here and then a component of the data space where there is some disagreement among the classrooms we call that a region of disagreement what it means is that more and more classifiers disagree with the majority of classifiers in those green region though in the green regions that's where we might get some new information to refine what we understand about the set of classifiers that might be good for this particular problem and so the ideas will get to draw another set of samples of those that blacktop so we have to decide which of these we want to label the labeling examples in that great region doesn't help us at all because we already know that the true classifier won't be in that region all the class fires that are consistent with the data green the great return so we don't bother with those we only label examples dinner in this region of disagreement so these 2 in this example and then we refine what we understand about this region of disagreement we have a new region that smaller than before and the idea of disagreement based after learning is to generate this process to efficiently learn how to class and by fission the I mean by labeling a small number of examples rather than just picking examples at random you label were only asked for labels on ones that will help us refine this region of disagreement. [00:17:03] No The upshot of all this is that there are ways to design classifiers so that if you want to buy the classifier whose error rate is within Absalom of the optimal linear classifiers I'll call that absolute optimal path of learning would require order d.-o. over epsilon labeled examples d. being the dimension of the feature space and Absalom being this desired accuracy level after learning contrast is also linear and d. but only logarithmic and went over epsilon So we have this lot wrong a were epsilons Epsilon and that's the exponential gain that you can get through active learning in a situation like this so. [00:17:45] The challenge is that I've served drawn you what the region of disagreement looks like in this very special setting of linear classifiers passing through the origin in a 2 dimensional feature space more generally the shape of the region of disagreement is much more complicated more complicated higher dimensions in fact it's even more complicated in 2 dimensions if we eliminate this restriction that classifiers passed with the origin so let's look at a General in your class prior to the bench now if you're unsure of the data set clearly any linear separator of this data set will not pass through the origin of unit policy or we can start saying well what sort of linear classifiers would a separate Well here's a class for this would be what you would consider the max margin when your class fire. [00:18:35] This is another class fire that separates us data here's another one and another one and so forth so all of these are possible class fires that separate the label examples that we've seen so far in this case and this is what the reaching a disagreement will look like so it's much more calm complicated doesn't really have this nice cone like structure and so this is where the general story computing reagent disagreements going to be computationally expensive it's sort of a prominent Tauriel object now you need to get your hands on and just sampling everywhere in this region of disagreement might even be inefficient high dimensional settings and I'll explain why that is a minute so the approach that we're going to discuss is the idea of approximating this region of disagreement it's a complicated set so let's approximate with a much simpler step in that the form of the simpler set that we're going to look at is a spiritual site but I'm sure you are in the large circle of spiritual segment in 2 d. and then down in the lower right spiritual segment and 3 d. and this idea continues to higher dimensions so that's the idea and now to understand this let's look at the special case of homogeneous on your class fires this is the case we're talking about just a minute ago where the class fires must all pass through the origin so there's no offset or bias and when you're asked Where does a special case this case was studied by need a Balkan and so long Nina was a professor at Georgia Tech for a while so probably many of your know her and this is a really beautiful work that kind have exposes a super interesting phenomenon that's relevant for learning in your class fires and that is that I dimensional wall have most of their volume you're near the equator and because of that for a fact of high dimensional probability theory that that sort of drives what Nina did in her work though here is the idea. [00:20:32] The problem is that if we just over bound this region of disagreement if a spiritual segment it could be very wasteful and so let me kind of show you what I mean so if your is an over bonding approximation to that region of disagreement and you can see a completely covers the green region of disagreement but it also covers a lot of area that we know where all the class fires are re so we have a lot of volume in fact an enormous amount of volume near the equator that's just being collected and with these are wasted samples we don't need the labels for those examples but we're collecting them anyway because we're using say this this or overpowering approximation to a kind of mentions most of the volume is going to be close to the equator it means that almost you know a huge fraction maybe the majority of the labels that we get are all on examples where we already know how to label what Nina did and this is the really key idea her papers she said well we need to be more aggressive about how we approximate we need to localize things a little bit more to reduce this wasteful sampling and so the idea is to use a spiritual segment again but now one that doesn't completely cover the reaching this agreement and what what's done is essentially balancing to get fewer wasted examples near the equator well at the same time not sacrificing too many of the helpful examples that are in the region of disagreement that's not covered by this segment and so the upshot of this more aggressive localization is that instead of having a simple complexity or a label temple complexity that grows like each of the 3 tasks Nina was able to show that she can get the strip information theoretically optimal dependence of just linear in dimension using this aggressive procedure. [00:22:26] Ok. So I just want to reiterate that. Everything looks pretty simple here in 2 dimensions we have the nice cones when we're having class tries to pass through the origin but these regions of disagreement are much more complicated and general and that's why we have to use these over these approximation techniques Let's go back to this general in your class for this is something that wasn't covered in Nice work in important practice for a couple of reasons 1st of all in general when we train when you're classifiers we really usually don't insist that they have no bias terms of this biased or b. that I'm showing you their offsets our class fire from the origin and that's important in general also another thing that we can study when we look at more general in your class Ayers is that classes are often very imbalanced in the setting that we were just looking at before but the class fires going through the origin of the 2 classes were perfectly balanced one class was and one hemisphere the other was in the other hand this year and that's. [00:23:31] Meaning that you have equal number of examples from both classes don't because of class and balance which happens a lot for example there are many more healthy people than anything we need to be able to handle that kind of bias in the prior probabilities of the 2 classes in this more general form of your class fire this model that we're looking at will help us do that so the challenge is though that come up when we look at this more general setting or that the region of disagreement can be much more complicated even in 2 dimensions as we talked about and then I have specifically what that leads to is the fact that spherical caps of high dimensional spears are much smaller in volume that segments that pass with the equator don't because we're offset from the equator now everything will get even more complicated than it was before. [00:24:21] You know we need to figure out a good approximation to the region of disagreement in the general in your class higher setting that's more complicated than it was when we were looking at homogeneous classified as the past or the origin so this is something that Mena Carson Tonight started to look at and this is how do we approximate a region of disagreement in this general setting with the spiritual segment showing you the general setting I'm showing the spiritual segments and the problem again is that there's exponentially less volume in this era called cap right and so what that means is that if we just use a region of a spherical segment that somehow covers a portion of the region of disagreement if we're not careful that segment can disproportionately favor the majority class in other words most of the examples to be asked for the labels for will be coming from the majority class because of the class balance and also because again there's so much of the volume decreases so quickly as we move away from the equator we could potentially be wasting a huge number of samples by sampling in regions near the equator that aren't part of that region and just read meant to what we need to do is somehow move and the design a spiritual segment that is it's our from the equator as we can make it to better balance of classes and reduce this wasted sampling and that's just a lot more trickier than it was in the homogeneous case because in this case it turns out it really doesn't matter how you were in this segment as long as a story oriented in a way that passes through the origin and Trevor's. [00:25:59] A good fraction up the region of disagreement so there wasn't so much sensitivity there but it becomes much more challenging and and problematic when we're not looking at the homogeneous case the general. To here is the solution that we came up with we're going to construct a spiritual statement based on the data consistent classifier that's what was just at the origin and something to show you what that one looks like it's this guy here so that's a linear crossfire that's pushed as close to the origin of the unit ball as we can make it well still being consistent on the data we've seen and what that actually does is it separates the unit ball into 2 pieces and we call this the Maksim and volume separator because it has the effect of maximizing the minimum volume of the 2 pieces in this case the minimum volume is the upper part which is a sphere called Tap and turns out that's the right trip or at least it's a it's a right prescription that will work that we use this spiritual site and then it better bounces they're playing between the 2 classes and it reduces wasted sampling outside of the region of disagreement because even though it looks like there's a large portion of the segment that's not overlapping or intersecting with the region of disagreement that sect that region which is sort of in the right here is away from the equator so even though it looks large here at high dimensions that's that's a relatively tiny area so this is true as a result of this maximum volume separator of basing our Super Bowl segment approximation and that in and sampling that region is the key idea and if we assume that our learning algorithm is initialized with the least one example from each class that's just of weights and a strip of necessary search to find at least one example from each class at the beginning but later we have that initialization then what we can show is that we can learn an epsilon optimal classifier with the optimal sample complexity both in terms of dimension and in terms of the accuracy epsilon. [00:28:05] And so that's sort of what we know about linear classifiers I would say that you know by and large a lot of the theory and algorithm for the active linear classification is pretty well developed at this point there's one open problem I'll just leave you with and talk about it afterwards people like it's just efficient methods for finding this Maksim and volume separator if it's naturally post it's not really a nice comeback optimization I think there are good ways to kind of solve this attack with some people about it but. [00:28:40] I don't know yet what's the most efficient way to find these maps separators that we use for this construction job that is what I want to say about in your classroom and we do a quick break any questions now or I can move on to the next section of my truck Ok I will continue to. [00:29:07] That's great when you're classifiers are good they're easy to understand they're easy to interpret but they're certainly often not what people are using practiced and there can be especially problems with using simple models a cleaner classifiers in active line because these models can really break down the active part and process so I'm showing here in the 6 ample where we have 2 features in a feature space a linear class fire so that's the best linear class for we could find in a setting so there is no linear class where that's consistent with all the labels that and here is a sub optimal in your eyes this one. [00:29:46] Will not have it will have a larger error than the other one. And so the problem is that when we fix ourselves or force our active learning algorithm to use a particular model like a linear class Our that introduces sort of a bias of the whole procedure and stand it standard active learning algorithms might perform no better than passive because of this because the model might not match reality and they can even if we're being aggressive like we were just talking about Who are these approximations reaching out to stream and so forth I think it might even converge just optimal solutions like what I'm showing on the left so that's a problem and the basic idea is active learning suffers when models are mis pacified so that keep just moving to a more non-parametric kind of setting that we can use not per metric model that could learn arbitrary kind of decision mountains but I'm showing you here with the same set of examples where linear classifier can classify the examples correctly but something non-linear could as I'm showing on the right maybe we can try to develop active learning methods or more flexible models like and there has been work on this by myself and others and basically the past work has been theoretical it indicates that there are some potential for active learning non-parametric settings but there were no practical sort of methods that were coming out of that work that what I'm going to talk about in the rest of the talk here is how we can develop very practical active learning methods for non-parametric models and in particular because this is work with Mina we're asking the question about can we develop active learning algorithms for popular classification modern classification methods like Pernell methods and neural networks and so we're going to focus particularly on single hidden layer neural networks and kernel classifiers which both have this sort of linear non-linear linear type of structure and we have some some nice results for how to do active learning and in these settings and these are completely flexible they could handle essential arbitrary classification problems as long as we have wide enough neural network or Richard not kernel and reproduce the kernel Hilbert Space. [00:32:06] Ok so how do we get started on thinking about this well I think the right perspective is one that's been popularized lately and that's is to look at generalization here from a function space point of view that what I'm showing in this picture is the classical for a body it's very straight. [00:32:24] And on the vertical axis we have air and on the horizontal axis we have model complexity where the number of parameters in our model and some of the the usual story is that well if we try to fit our model perfectly to the data by giving it enough Ramit or surgeries of freedom we can get the train here to be very small but we might suffer a very large test here in other words we're over a fitting the data and so the classical story regularization and so forth what gets them how to balance this and hopefully find the bottom of that test or curve something that gives us the lowest test or let me not perfectly fitting the training but that's not what people are doing with the payroll numbers and other methods these days that really just train to near 0 air and so a lot of people have been studying this and there's this nice work by Mr Bucket and his colleagues which kind of popularized this idea of the double descent so this is how it might look if we thought about neural networks with a certain number of parameters as we increase the number of parameters beyond sort of the number of training points which is the so-called intercalation threshold we'll actually see sometimes or often that this error Generalization Error all actually decrease as we go further into further over parameterization which really going on here are the key ideas that we shouldn't be obsessing about the number of parameters rather we should be focusing on the function space associated with the over parameterised models and think about minimizing the norm of those functions subject to a definitive And if we do that we can have you know a huge number of parameters as long as we control the complexity of the resulting function by considering its norm in whatever function space it's subject to trying to fit the data in other words trying to get your train here. [00:34:15] And so this is sort of the the motivation for our active learning theorist ik which we called Maxime in active learning and let me explain it and this is kind of an important slide so I'll go kind of slow to make sure everybody gets the idea so you stars going to be an unlevel the example that the machine is going to select to ask for the next level and so what is this optimization doing so the set big new is a set of all possible unlabeled examples it could select from and then what I have inside this men is the norm of the new function that we learn if we perpetrate the example you now if we add value to the training set we don't know whether that should be labeled plus or minus one so we consider 2 possible new learned models the learning model adding Us To me it has a negative label and the new function that we learn by adding new assuming it has a positive way up and then we look at the norm of those 2 new Fit functions and we look at the min of those 2 and then we max over all samples so essentially what we're doing here is right labeling the example that maximizes the norm up what you might consider optimistic new function learn by adding it to the trading set and by optimistic I mean if we add this example how hard or how much do we have to change our production function in order to accommodate that new label example and we can measure how much that changes by the norm for him or her optimistically hoping that our guess at the label which would be the min of these 2 here is is correct and we don't have to do a lot of work in changing our function to accommodate the new sample. [00:35:58] So the intuition is that attacking the most challenging get samples 1st that's what this Max part is doing Milliman Inc the need to label other it's easier examples later in the process in theory what we can show is this automatically selects examples near the current decision boundary and closest to oppositely label the samples it's all let me try to illustrate this with an iq simple sex so let's go back to one dimension really you know one dimensional setting and what I'm showing you here is on the right a piece vice constant function so that function has value plus one or minus 5 and you can see there are 3 change points in these functions this would be essentially the these are the decision locations of the decision by entry in this classification problem that you read axes are some initial label points though that's where we kind of initially have sample and I'm just going to run a little video that shows you the progression of where this maximum criterion selects as we know is when you can see what happens is the process automatically sort of owns in on and focuses on these locations of the decision boundary and so I'm just looping over this process a few times to come watch it and see it but it kind of automatically finds in localizes this issue property so it's not sampling far away from those decision batteries too often. [00:37:24] So what we can actually prove is that if we have endpoints uniformly distributed in the unit interval and labelled reporting to a peace prize constant binary function with k. pieces then if we run this maximum criterion for active learning using either a low class kernel or a record highs and then your unit neural network then that active learning world perfectly predict all end labels after only labeling about a lot of the examples and so this is served recovering the optimal binary search or bisection performance it's actually a little better than that because we don't even need to know how many pieces there are this case just a known. [00:38:04] Part of this target function that we're trying to learn it's automatically adaptive to that to the recovers binary search and kind of goes beyond binary search just automatically The beauty is that we can apply this criterion you know in multiple dimensions it's just an illustration here in one dimension so here's how it could lock in multiple dimensions of to show it to so we can visualize it to what I'm showing you left there are the kind of interpolation or data in the function. [00:38:35] Sets to for label data points the top for a reproducing kernel Hilmer space plus kernel and on the bottom a neural network that's been trained to fit those points and the color that you're seeing in these images is the. Absolute value of those our national Fips the dark blue is Iraq so that would be the 0 level set or the current decision boundary of those 2 methods then you can see they are similar for the kernel method and the neural network so those dark blue portions of left indicate the current decision. [00:39:11] If we apply this Maksim in criterion to decide where to sample next the heat maps just to the right are showing you where the maximum criterion is telling that the algorithm to ask for a label example next to the bright yellow is where it will sample next and you see in both cases it's asking for a sample or looking for samples that are in between the closest operas to play labeled pair of examples in the. [00:39:39] An issue of training set so it sir recovers this kind up by section like performs in multiple dimensions and here's a little video illustrating this so I have 2 regions in a 2 dimensional feature space examples uniformly distributed over this feature space on the right label minus one on the left label plus one with the sort of sinusoidal like decision boundary and what happens is the initial. [00:40:07] Part of the algorithm just serves stress randomly sampling until it gets obstinately able the samples then it kind of does a binary search and then it's her stitches along automatically kind of tracking this decision boundary very efficiently I don't know in multiple dimensions we sort of have this multi-dimensional version of a binary search going on I'll just show some of next experiments to show how it looks in practice this is using Google plus kernel for the task of classifying whether a hand-written digit is either from 0 to $4.00 or from $5.00 to $9.00 so we kind of reduced it to a binary classification Kastrup that form we're comparing looking at the train test error both for Random House and learning and then the max human criterion act of learning and then a database version of Max human which I won't have time to go into but it's just designed to do a little bit better at early stages of the process when you have very few labeled examples. [00:41:08] But what you can see here is that both in terms of training and test air the maximun criterion is active learning is doing much better in terms of reducing the or directly so we can train more quickly and we can get better test errors in this case we're seeing about a 3 to 4 x. improvement in terms of the number of label examples that we need in order to design a good kernel based classifier And so I should just mention that you know the theory and some of the stuff I showed before talked about these exponential improvements in practice you know real life is more complicated and some of those idealized settings to getting these kind of like you know 3 to 4 x. improvements is it's kind of what we often see in real world problems and that's still very substantial because that means you know big reduction in the mop human effort required to create This is. [00:42:06] What goes into proving something about this kind of bisection behavior fall looking into the structure of the modeling spaces so in the Pearl case the reaper Center at the arm shows us that the optimal interval later of course is a super position of Colonel represent Herzog over this idea represented hearings in detail on a minute. [00:42:30] The trick so associate with that kernel function and have a special block diagonal structure in the plastics and we can use that time to show that the increase in the kernel norm is maximal at the point between opposite label points so that's exactly what gives us this bisection performance in the case of rectify their linear unit neural networks things are good mark complicated we actually use a new neural network represented here and I'll tell you about but this shows that in one of the these the optimal neural network is equivalent to a classic linear spline and it turns out that the neural network waiting arms and corresponds to a true variation in the arm that line function and the troll variation in arm increases the most when a new example is at the midpoint between the closest oppositely label examples which again is what produces this price action performance so we understand this very well in these 2 cases and the key to understanding both of them is looking at things through this lens or representing your arms so in the last part of the talk here I'm just going to talk about represent your theology and specifically talk about these new represent or theorems for neural networks Ok so the classical represented here and is now assessing Lucian's to certain learning problems an infinite dimensional spaces can be expressed in terms of finite dimensional parametric actions and this goes back to the new classic work of grace Waba a colleague of mine here at the University of Wisconsin her famous papers moving swans. [00:44:03] Since 970 reproducing Colonel Hilbert space represented the arms have had you know incredible impact that might spread using machine learning with kernel learning methods that it really kicked off in the early ninety's with support vector machines and active area research today more recently people that are looking beyond just Hilbert spaces to our experiences and. [00:44:30] Things like t.v. regularization Palmer a person stations and her work on her own that represent truth serums all isn't that more modern category it represents your so just to kind of recall here is the classical represent our theorem for reproducing Colonel Hilbert space Fs in our case just with some equipped with some Norman or through Ok then for any training data set in any loss function and any function minimizing this regularized to pure risk so we're looking at minimizing the sum up losses on our training data plus a term which is. [00:45:06] Weighted version of the arc a chest norm of our prediction function after those solutions admit representations the following form that I'm sure where fs a linear combination of the kernel function based at each of the training example and the Elf eyes are the dual variables that you're probably familiar with if you look at our k. just represent 2. [00:45:32] Types of classical represented here repeatedly tells us that this infant dimensional function space optimization and learning problem has a solution which is per metric we just need to determine these and numbers out one through Alpha and where n. is the number of training samples that we have. No my student rubble Pari asked the question is there a representative for neural networks and the answer is yes there is but it's not an helpers and so I'm just going to kind of give an informal statement of this represent tricks here and I don't I have a little bit and I think I'm doing pretty good on time at the m about 5 more minutes and I should be able to wrap up. [00:46:13] The neural network represent at the arm says that and I'll tell you what this panic space and Norm are in a minute but there's a bar space a norm budget for any training set and any come back to car or worse of lost function l. there exists a solution to this infinite dimensional space optimization learning problem that has a representation which is exactly in the form of a single hidden layer neural network so that's what I'm showing down at the bottom there where the fee function is the activation function and Katie is that number of neurons and the number of neurons needed in such a solution can be less than or equal to in the number of created points so it's very analogous to what we just opt for r.k. chestnuts doing here is a kind of just spent the last few minutes diving into little bits What are the Onyx bases and lots of connections in our own networks so the bonnet space has a set of functions mapping r d r that I have the following file and it norm so that we're looking at the function after then we apply in array to love plus you operator this is a differentiation stuff and then a filter rod and transform And so what this will quasi in is doing is measuring the roughness of the function if you will and we're looking at the roughness of the function in the rod and that means and I'll explain why that's the appropriate domain to consider that roughness or irregularity of functions. [00:47:40] What we show in the representative then is that if we try to minimize this norm subject to data fitting the solution is. That mark and the activation functions are include the re to function so you know the function spaces are index futures now and we post one corresponds to the case where we just apply one operation at the laplacian operator it turns up and the solutions have this warm up rectified and then your you know neural networks if I increase sensually the activation functions are higher power versions of truncate power functions by drought a cubic trick spectra and I'm I'm showing you what those look like in the bottom right corner so. [00:48:24] I want to see a little bit about the function arm and waiting arms and generalization powers so if I have a neural net worth a single hidden layer mark up that form that I just described the space norm is exactly equal to it never path an arm and this is one of the key connections between the neural network weights and the function space representation Moreover we can show that the bar space in arm essentially control the generalization here so if I have I say minimized the data fitting subject to forcing the function in that on a space to happen or less than or equal to some value of capital p. then the generalization here is proportional to that that beyond the norm so why the rod a transfer so I think comes to a couple of minutes to explain this and then we can stop for questions so the key ideas that the neural network neurons are what are called Bridge functions so here I'm sure the right fight when your unit activation function and if I look at that with that generates then with the 2 dimensional inputs that neuron it gives me a ridge light function I'm showing on the right there. [00:49:35] Really functions are parameterized by 2 things their orientation or angle and their offset and so we can look at present up here I'm watching top down out that team Ridge function where the orientation is 0 I can rotate it here at the orientation at 3 quarters or pie and then I can translate it that's what the bias or offset is doing be based So those 2 things kind of the functional form Manning for example is piecewise linear form here is dictated by the activation function but then the weights of the neural network and ISIS are telling you the orientation offset and some for setting multi-dimensional functions a different orientations an offset the run around transform a sort of the key tool that we would use the Web basically does is it looks at multidimensional object and its projection its different orientations and so we have different orientations and then the start dimension in that projection is the offset in this idea that goes back a long time right on transforms and Ridge functions pairing up with analyzing plane waves back in the fifty's the term rate function was point in the context of tomography back in the 70s right on transforms are central to the analysis of Richmond's which were popularized in the ninety's and then more recently great gunshy and his collaborators made this connection between the rotten transformers and single hidden layer really not ours the main sort of thing that we bring to the table in our work is rather than just focusing on how well we can approximate functions with bridge functions we look at exactly what kind of functions are learned by fitting your own networks to training and so that's something that is kind of the learning from data version as opposed to the approximations theoretic question. [00:51:22] Here are just a few pictures to illustrate what's going on with this Norman the right arm transform So here's a really function it's a special type of wrench if we apply the law plussing operator which is basically taking a 2nd derivative then we get just a line when we apply that to the railroad car neuron and that line is in the orientation of the activation at that rate little and then after we apply the filter run transform what that does is it basically looks at that in terms of orientation an offset and we'll just get an impulse exactly at the location of the orientation offset that original raiser in Iran don't have sparked this news to operations are so if I under on and just give us a direct delta in the right arm domain though as I just said more generally differentiation the iterate Laplacian and annihilates polynomial surfaces leaving only when your boundaries up the activation threshold and then they don't run on transfer mix track your intuitions offset and then finally that l. Ron Norman's basically measuring sparsity in the Rod I'm doing so here's how it looks for a 7 neuron network were wrecked if I can make you read this that's right I'm showing at the left you applied all classy and twice you'll get the black lines that I'm showing the middle those are the activation boundaries of each of the neurons and then if I take that filter run transform I get this map that you see on the right which shows impulses exactly at the office and orientations of all of those 7 arounds. [00:52:56] Right so I think I'm just going to pass over this because to grow a little short on time happy to talk about what goes into the grease approving that represented here but we have got to get through the lens of Green's functions and measure theory and the classical measure recovery problem which is goes back to the forty's and you can put all that together in this context and see that the solution to these optimization problems are single hidden layer neural networks and so I'll just wrap up with a couple of takeaway messages. [00:53:26] We talked about how the nonparametric running problems with this particular on our space structure of apps price solutions which are all networks and $1.00 thing we can do with this is look at what happens when we add a Trini example how does that factor in the increased norm and that's the key behind our active learning theorists next couple of things I didn't tell you about is that this regularization actually when went to a formal way to Kate's s.g.e. and in the one dimensional case reduces classic polynomial splines and I'll just show you a little example here of fitting 6 points with a cubic spline fitting them with the rectified cubic neural network using s t v a way to Casey exactly recovers you keep exploring but if I don't use way to k. I can actually learn and target later that is more complex than necessary for example this one has an extra bunk here so there does seem to be something important going on this connection with way to keep these functions. [00:54:25] And I'll just wrap up there I won't go over what I already told you but if you open problems are finding efficient methods for computing this Maksim and Alim when you're separator and then also efficient methods for the max human criterion in nonparametric setting and maybe try to extend represented life here to deeper neural network structures and those are a few papers that relate to what I've covered saw stop there they were doing Ok. [00:54:56] Yes we have a little time for questions this is great thanks a lot of wonderful talk all the losses on the sharing and so I guess there are a bunch of questions or start to weigh is one question has a lot of like so then just a question about how does active learning dealing with life and model could ask for a label of a noisy to deplane give me always and he would go through a normal datapoints if we readjusted 7 Moulay from optimal solution I guess this is about how active languor recently Great Yes Yes So the question is What about noise and that's a great question active learning methods can be noise seeking a way to rank in some ways or try to measure how certain or uncertain our current prediction is about an unlabeled example and one reason. [00:55:49] A prediction might be uncertain is just because there may be a lot of noise in our label data and so that is a problem there are a lot of different ways people can deal with that I mean it's really focusing more on the case where we're thinking about not having a lot of their winter apostrophized often you're classifiers talked about in the nonparametric setting there are also ways you might make the methods more robust if you believe there is potential for it labeled noise or outliers or whatever in your training data on the other hand if we think about a lot of static like you know image classification I really don't think there's too much label noise might really want to try to be as well as we can to the label data we have and so when I was describing what's it that kind of scenario very well you know. [00:56:35] Roughly speaking you can do a little bit of to handle noise a little bit of voting over multiple examples in regions of interest and then using some sort of localized majority voting to try to mitigate the effects of noise but that's kind of. In essence one of the things that. [00:56:55] It's used on another team i.q. is rather than just having a single class fire you might learn multiple classifiers possibly with thinking that some of the labels you are getting might be missed label and then you could generate sort of an ensemble class hierarchy and then average your projections over them that's another way could reduce some of the tolerance judge repression and how to handle it really depends on this particular set up that sort of model structure that you using so I could follow up with that person offline if they want to chat more about it great that's cool thanks I guess there are a bunch of questions depending on how much time we have I think we can do a selection Those them this is a great response I think from the audience the professor Shelley go on to ask out of your **** city or resolve regarding nearing that work than a function state's reputation there is in the end of your task are you considered Has there in to generalize to the cases that red rippin station function is a multilayer neural net for it instead of a single or your battery. [00:58:02] Yeah so that would be great to be able to generalize this to multi-layer hypno neural networks so far I have to admit if that were easy to do we might try to do it it seems not to easy to generalize this to multiplayer settings the compositionality that is so powerful here in multiplayer neuron that Rex is what makes it kind of more difficult to analyze there's just a lot of dependencies through the layers that that makes it non-trivial possibly impossible to extend these representatives to multiplayer settings but it's something we're thinking about and I think it's a great open question it might not look the same it might not have the same. [00:58:50] Characterization that we have for single hidden layer neural networks but maybe there's something else that would be helpful in the multiplayer setting but it's an open problem who will and there's a question about would you is this before school into training battlefield. Or in the wild to learn will rule out a training that typical ballot will cast that represent the polls. [00:59:16] Yes So usually. So he'll yeah so that's right we would do the act of learning and sort of selective sampling only on training but for chastity and validation we would always have like a nice unbiased holdouts that that's exactly right that's what we would normally do the one thing I will say though is that because active learning can give you you know it's actually very large speedups it in some nice settings exponential speed up so passive learning there are situations where you can learn a very very good class of fire but you'd need an enormous and no amount of test data to even estimate the test error to the accuracy that it's achieving and so you can always you know you you know other words you you're holding at some of the test there is or it went down like one over the square right but the actual. [01:00:15] Error of your class right could be going down much faster and active learning Joe a little more. Interesting back there who exactly there's a lot of the question I'm puzzled about label for is question do we label the older ones or we continue to follow the ball at all. [01:00:37] Yeah so the normal way that you know there are different ways to implement active learning systems the normal way would we would act implement them would be kind of in an online fashion so. We ask our human or our crowd to label some examples we Cato's label examples try to learn a quick predictive model use that learn predictive model to select which examples we'd like that person or crowd to live next and we repeat that process or even the sequential I mind fashion. [01:01:11] One of the the practical challenges that comes up is that once a person or crowd delivers some label that samples you have to retrain your models that take some time and it might take more time then a few 100 milliseconds so people might be ready to label more examples before you've been able to complete the update your model so a lot of the. [01:01:37] Challenges that are going to be implementing structure systems deal after deal or you deal with the need to cope with the sort of latency is between how quickly you can update model how quickly people can label simple how quickly you can select new examples for labeling software saw all those things are happening in parallel and they all have their own latency snow that's up that's a challenge it comes up in the actual implementation. [01:02:02] Ok makes sense there's a question about the theory I guess so I ask about ACA learning down of the Divide by log one it's still an existence that the distribution is special uniform or at the shopping clock come to is this correct or average 3 or distribution this is correct yes that's correct in that case so it knows the radical analyses of the near classifier are assuming uniform or isotropic distributions the one thing that our new analysis comp allows us to cope with the situations where we have these images in use when your class fires and also a large potentially large class imbalances I think that's important because. [01:02:50] These new ideas might also allow us to be a bit more robust to some of those distributional assumptions that underlie the theory although that's just a sort of speculative at this point but there is something to her intuitively appealing about this Maksim and volume separator as a basis for deciding where to sample in general when you're active learning problem create There's actually a related question on this Maxima while you approach so the question to ask Could the maximizing the Magna minimum while them approach the favoring the majority class in some classes I It enables that or balancing while you're allocating to both classes that's real mathematically trying to achieve an exact balance while. [01:03:42] Well I'm not I'm not sure entirely sure I got the Question What was the last month or so I guess the 2nd the 1st part is acting could the maximum minimum volume approach the favoring the majority class. And so 2nd parties in that are balancing while you allocate to both classes and so is there mathematically this democratically trying to balance the volumes for those classes yeah so that's a great question so 1st of all you know since we don't really know where that the 2 class conditional distributions are supported they in the unit fall in that those examples are. [01:04:28] Region that we temple from in this fearful segment is just an approximation in this Max maximum volume when you're separator is just an approximation we don't really know that that's where the 2 classes are separate though the idea though is that particular choice of the messy meant volume separator is at least one that kind of gives as much body I'm on the side of the minority class as possible and then what we're actually balancing in our analysis is the tradeoff between. [01:05:01] Sampling in volume where there is no district movement among the classifier So I'm sorry the region of disagreement balancing that with Congress as much as possible the region of disagreement and not missing important reaches of it and so those are the 2 things that we can actually balance much we were sampling within the region of disagreement rhesus how much we're sampling unnecessarily outside of it but roughly speaking the intuition is that this is also helping to balance between 2 classes Ok as a great. [01:05:37] I think given the time able to take your work question I think one question is as a whole predicting possible contra factual but will to label value sounds competition are expensive specially if like every data point in the tree that is a concern in terms of rent time or memory totally that is. [01:06:00] Your making the things practical you know in principle let's just take the neural network case we would have to retrain a neural network twice I mean maybe later were started from our previous solutions but still it's a lot of retraining and we have to do that for everyone right. [01:06:17] Now we can obviously reduce this by having some he respects you 1st of all there now I heard of all the only ones we kind of knew where we think they might be or. Well the most important ones might be too might be able to cut down there and then in terms of retraining maybe we're looking at things like tourist excited just looking at the norm of the gradient. [01:06:41] At each of those new points which just involves one green chocolate Titian it as opposed to our fully retraining and there are some reasons to think that that might be a very reasonable and maybe even provably good approach I connect a little bit to things like neural tension kernels but also relates to some other sampling methods they're using other adaptive situations like multi arm bandits ribbon your bandits and so forth so there's some some good good ideas I think out there for possibly making these more computation the feat of great great I guess the last the question we should summarize the final question is as can about possible references so I think all of these are interesting seeing and papers relate to your presentation today so you know what details I think you probably provide some in the last page of your presentation Yeah the last page in particular has 3 papers relate to this and in those papers there are a lot of references to other Burton things and I think I sprinkled a few other references in throughout the talk but if people. [01:07:45] Have questions or would like printers they're more than welcome to follow up with me and I'm happy to share my slides too I don't know if they're there something you guys would want to I mean we have the video but I sent you with the slides to emigrate Yeah we'll follow up with that I guess and all these are interesting to see if there's any introductory level terrace and references still acting like me in general and how to differentiate actually learning from other lights that are such as incremental Larry and so on so is there any basic reference topics you know there is there is a monograph or book written by it. [01:08:27] He was a graduate student back in the day here and he got the message his name as per cent else I think I probably reference at some of these papers but that's 11 good for have overview kind of. Elite what people were thinking maybe about 810 years ago I don't know if there's a really great for up to date survey paper on have to have learning. [01:08:53] I might have some kind of or reduce use in some lectures if if somebody was interested wants a follow up with me I could tell them some pointers to the Settles monograph and then also to maybe some notes I have that might be helpful and the kind of whether you're interested in the theory part or the more practical aspects of it. [01:09:17] I guess that's pretty much all the questions from here and there's probably more. Clarification questions and so I guess we could probably assemble still as well as given the time sake helping Claire and then I think 10 minutes past one hour so I guess that's pretty much all tank now thank you so much thank you so much again for the wonderful talk and as to the question this is really great research thank you rate Yeah thanks again for imitation to speak and thanks for all the great questions and listening in and if people want to follow up with me I think my emails also on the 1st flight might reach me Ok wonderful thanks for that we're sure. [01:10:03] Ok. I.