[00:00:05] >> But it will have I have a I do you have a take from Brown University will be able to talk x The Russian learning actually just text me not idea. What my formal group made me want to tell you off Pennsylvania. So he is **** Laurie I have money interesting research interests it's always like running a friend's house as others have the authority work understanding contacts already beddings So that's the look like a wash it will be going to talk about our money today but on the other side I actually started doing. [00:00:43] The buckets and the P.M.'s already said printed restructure is in the bill so all work quite well you can save the Union Human and get the loop kind of cross all things related research has a large broad interest. She is also leading of people big projects including the land they are working together for I abhor my keeping goal information the corrections in the Irish information retrieval so that had been going on for more than a year or so I got to see Eddie and I really I really like in time. [00:01:22] So yes laid off father deal with that addy talk about her own work. Thank you so much Ray thanks for the introduction and yeah it's been fun at no way for so long it's a shame to not get to go visit there in person but I really appreciate the attention and it's been fun seeing you so much recently and yeah I think thank you all for attending and for hosting me I guess virtually. [00:01:50] So today I was going to talk about some work that kind of spans the past couple years as we mentioned I've been recently very interested in kind of contentious where embeddings language models. Basically what they do and don't represent about language. And so this is kind of somewhat linear. [00:02:11] Story of some of the work we've been doing sense around and 2018 kind of building on these questions of trying to understand like kind of why they succeed and why they fail in the ways that they did. Search just acknowledgement of my collaborators in particular in Tenney who I've worked with a lot at Google who is the driving author and all of this work and my students Charlie Lovering and Toronto who are. [00:02:36] Who are students at Brown and our 2nd lead authors and a lot of the work and several. So over the past 2 to 3 years there's been a lot of work on trying to understand what do these deep language models understand about language and I like to kind of organize the work that's been done into 2 broad categories. [00:03:01] So kind of 11 and approach has been really focused on asking what types of features. The representations in code about language so this is a technique called Program class buyers. And then. The kind of parallel line of work that's very related to slightly different Ask do models behave like they're using specific types of features and I think of this technique as challenge tasks. [00:03:30] So just to kind of set the stage for though the work I talk about later I want to give an overview of what's been happening in the in both of these kind of hides of this analysis over the past couple of years. So on the 1st question. So this is really people asking what is some specific type of feature in coded by our language representations and here we're dealing with these neural language models so if you're not familiar you're not all peepers and they're basically just big neural nets that are trained on some version of predict the missing words or predict the next words so you would say you get an input sentence and you're training them out say Ok you're about to you're about to start a sentence which when you think the 1st word should be. [00:04:17] And it predicts something like that and you say good job Ok you predicted the word though what do you think the next word should be it's something that car and they're just trained in this way so it's kind of it's helps elf supervised training method that results in it turns out very good representation of the words themselves so very often we don't care about this task per se although often we care about generating language but more often we care about the representation as we get of the individual words so that factor for the factor for car the vector for is not. [00:04:50] And so then the question is what do these back to represent patients in code that earlier representations of language that we had didn't necessarily codes and so the technique where we use the law in this kind of study is like I mentioned a called probing classifiers or diagnostic classifiers an idea as we take these representations which have been changed to do this task and we freeze the network so you can make updates to the doctors themselves and then you put some clothes. [00:05:17] Sofar on top of those representations and you ask for example does the vector for car contain the information that car is a noun. Or does the does the pair of vectors for car and polluted Kether can I recover the fact that blue is a modifier of car in the sentence and so this is somewhere from I clear in 21000 within 10 iii and a couple other collaborators who are looking at basically for lots of different types of linguistic features we care about how well do these representations encode those features and specifically we're looking at all of the different features that kind of come from the traditional and LP pipeline. [00:06:01] This includes. For example part of speech tagging so we would like to know that isn't as a noun is a verb Rand is a noun. And then usually kind of been Premier all times right we would we're trying to model to explicitly produced parts of speech and that would get fed into the later layer that might use something like entity recognition so we would care about the fact that Disney refers to kind of an entity in the world something with it we could be a page for example whereas the word brand is a common noun that does not have that kind of entity status. [00:06:35] We do things like parsing so you would want to know that it is the subject of the verb is and global brand is the object of the verb is. And we might do things like a reference so you would want to know that. That is me and it are referring to the same entity and the scent and so these are kind of. [00:06:55] Increasingly. I know these are increasingly maybe abstract levels of linguistic annotation or kind of linguistic processing that build on each other and serve as the basis for a lot of a lot of the and the models that we would build kind of back in the day right where you would if you were doing these tasks illicitly and so the question is whether these language models. [00:07:20] Trained in the types of for self supervised where just scribed actually this kind of information. Kind of for free and so we can look at all of these different tasks that go kind of fall on the paper pipeline and we can look at how well our model does in terms in accuracy on these tasks if we just start with the word Pryor and you can see this kind of annoying thing about I don't hear language state in general is that the word priors are very good so like if I have a word like a car I have a pretty good guess that it's going to be a noun that I don't need to know much about the context in which it occurs. [00:07:56] The 1000000 fi see a pair of words one of which is blue I have a pretty good guess that the tendency relation there's going to be a modifier relation because blue often when it occurs it occurs as a modifier right you can see that there were priors are quite high but even so when we train these models we get very significant gains in many cases close to comparable with state of the our model strange to do these tests directly so you can see that it's a little bit easier to look at if you look at the Delta over the the word prior and you can see that in many cases double digit gains in terms of how well the context size models are in coding this kind of information. [00:08:38] We can all of it deep right in a pile of paper we looked at not just like whether the information was there but kind of how it was organized I'm going to show a fairly complicated figure so very with me I'll work through it but these models are extremely deep so these language models are off in many many layers in this case we're looking at the bird model which is a 24 layer model on the 1st large model so what you're looking at here is a figure that's going from left to right each bars are each kind of them in this figure is looking at one of the layers of the network and the important thing is that the later layers can depend on the earlier ones so if you're trying to learn increasingly abstract representation similar to how in computer vision you kind of start with things like you know basic edge detectors and then maybe get shaped detectors and you can move up to like cat detectors we would imagine something comparable here in language where you would start with kind of lower level linguistic information and build up towards higher level information. [00:09:37] And these clues ours are repression weights they're our way of measuring how important a given layer is to making a decision about one of these linguistic features and so you can see across these different tasks we're looking at part of speech constituency parsing depends you parsing etc That kind of organization follows the pipeline structure that we are expecting so this kind of part of speech information which is kind of consider lower level happens a bit earlier on the network whereas things like reference resolution which depend on having good part of speech and good parsing that happens much later in the network so we can't really say that there's a real causal relationship going on here yet there is definitely. [00:10:21] The signal is consistent with a model to ing kind of doing the right thing right like learning these low level it was the features using them to farm our shacked linguistic features and then eventually arriving at representations that are that are competitive with explicit how strange these tasks. [00:10:39] Great so that was some of our work on this topic but I just want to highlight that we are far from the only people working on this question or or applying kind of probing classifiers to this task so another highly. Highly visible result in this area was worked right here in a Manning that's in your play 19 and they were looking at the same protection which models and basically trained a model to project from the embedding space to something like an explicit syntactic parse tree and showed that you could do that with State of the r. accuracy. [00:11:12] It was also great work I clear the same you're looking at kind of at the neuron level within the network for evidence that you could differentiate linguistic concepts like like verb tense or grammatical gender and they showed that that is also recoverable at very high accuracy and there's been a battery of of other types of results so I kind of summary for this left hand side is that there's been an over the past couple of years there's mounting evidence of people suggesting that these models really do contain a lot of linguistic information about the sentences so they don't just contain what we often refer to as a backwards type of representation but rather they do have they do seem sensitive to quite a bit of non-trivial interest extraction Ok so that's on that kind of 1st side but like I said there's been a lot of work happening in parallel so. [00:12:08] So. At the same time people have been looking at whether models behave as though they have access to these rich types of features. And so the question is whether some type of feature or speech dependency parsing etc is actually used by models to make decisions. So we can think about some of the kinds of tasks where these things are important you can imagine something like a basic inference task so here what we want to model to do is take us in put a pair of sentences like they're apples and bananas on the table they're apples on the table. [00:12:41] We would refer to this as the premise of the 2nd as a hypothesis and the question the task is to predict whether they are missing tells the hypothesis or whether they have got this just like logically follows given the premise. In this case obviously the answer be yes if I were to make a small change like apples or bananas and hypothesis doesn't necessarily. [00:13:03] So this isn't a study with. Tom McCoy and Collins and who are the lead authors on this from. Johns Hopkins. So we were looking at whether models were reasoning or using kind of this rich syntactic knowledge that they appear to have access to in order to reason about these kinds of inference texts so we designed sentence pairs that look like this where basically all of the words in the hypothesis if you're on the premise. [00:13:32] But sometimes this hypothesis entailed and sometimes it's not you can see in the 1st case the hypothesis entailed in the 2nd case it's not. And the reason for designing it this way is kind of the half insists was that if the models are using kind of rich syntactic structure then they should be able to differentiate between these 2 cases for example by reasoning about dependency structure or semantic roles which we saw earlier they capture fairly well. [00:14:00] But if they're not actually using those features they might fall back on some kind of dumb surest stick something like lexical overlap and then they'll actually predict Intel mean in both of these cases despite the fact that there is a. There is obviously a structural difference that would allow you to separate the agreements from their. [00:14:19] And so evaluated a set of models that were at the time state of the art and this includes the Burma I was describing before. And on a kind of standard test set and I test that they'll perform competitively. If you test them on examples like the green ones so where all of the words in the hypothesis a piano promise and the sentences entailed they get near 100 percent accuracy because they always predict entailment But if you test them on examples like the red ones they get close to 0 accuracy because they also always predict. [00:14:50] And so this type of result has been. Again kind of was surprising at the time and has now become almost like a platitude right like that how how readily these models. Appear to be making decisions for the wrong reasons or using them here as sticks and so there is this example of them using lots of overlap and other pretty high profile one hence this is work from Rachel reading her but there's been a lot of similar work on looking at kind of like reference and how models will exploit and have biases like gender profession color currents rather than reasoning about the syntax and tense. [00:15:32] Pairs so the early work in this area which I really liked was and out of Stanford they basically took a question answering model and just added random distractor sentences into the document. From which the answer was supposed to be derived and this caused models to lose about half of their accuracy I'm just by the fact that these sentences were really unrelated to what to the question the model was trying to answer. [00:15:56] Though overwhelmingly the results kind of using this challenge test type of approach. Has suggested that models really aren't solving task by using these rich features right though so I think. This kind of set up has been really fascinating to me I'm just thinking through how kind of these sets of results co-exist and why we're seeing exactly the parents we're seeing so on the one hand it seems that you can extract this linguistic information from models they do appear to have access to it but they don't seem to use it as off task even when they really should be using it softer as like using these features would allow them to do the task better so what's going on. [00:16:38] So. I'm going to just talk through 3 kind of sets of results. From. From some recent work that kind of test different hypotheses about what might account for for this disconnect that we're seeing. So the 1st thing that came to mind in when I was brainstorming in talking to some of these other authors that we've been working on this is maybe there's nothing weird going on at all it could just be that the features that come as a result of pre-training get a race during fine tuning right so that we have some kind of catastrophic forgetting going on though even though the preacher and language are all had access to dependencies and semantic roles and when we find unit to do something like the inference task it just quickly overwrites all of those features and it no longer has access to. [00:17:31] So we tried to look a little bit more closely into either something like this happening and this is results from Marchant who's who is in Arizona not Google. And we were looking at quite a lot of things in this paper and just highlighting a few of the results I'm in particular whether these things listed features that we said were present as a result of. [00:17:54] The preventing procedure are still there after after fine tuning so this is the same kind of figure I showed earlier saying that the model contains a lot of. Signal that is not available in the lexical prior. And what we do is we just find change that purpose model on a large number of tasks or change its view in France or question answering or something like that not a large numbers mom actually and see if the performance on these any of these probing types of tasks drops. [00:18:27] And what we see is not really so the there are some fluctuation but it's hard to say whether any of this is outside of random noise. The one exception being if we fine tune on the dependency parsing ask specifically you can see improves constituency parsing I'm not something we've talked about on my follow up on but but for the most part we don't see anything that suggests that the information is really lost is specially not to the extent that the model can no longer use use those features for a downstream task if the if it was needed. [00:19:00] And we can look at this a little bit differently so is there using this kinds of probing classified as I mentioned before I also mentioned that you in Manning paper that try to train a projection from the. Connection that in space into parse through space so these are results using that type of analysis I don't get into the weeds on the metrics because it's kind of parsing specific so if you're not an r.p. person it might not be it's not that interesting but what we really care about here is the difference between the blue line and the other lines right and so here you can see that when we find you know on a task that's explicitly a syntactic task the dependency parsing we do improve. [00:19:42] The. Quality of the encoding of these and was to features so it does get better in some cases but when we find you know the question answer the inference ask we don't really see that it gets worse relative to the to the baseline model I'm in any significant way. [00:20:02] So I kind of take away there was that. We can't really explain the this problem we're seeing by. Forgetting issue because we don't really see these significant drops in their probing accuracy following training on the on the new tasks or following fine tuning something else has to be going on that would explain why why the features are not getting used to solve these tasks. [00:20:28] So the kind of 2nd have offices and I think maybe the most intuitive one is that maybe there just isn't enough signal in the training data to begin with and so the model doesn't know that these are actually features that it should be using like it doesn't know that dependency parsing is important for or actually which inference. [00:20:50] And like I said I think this is a very intuitive and it's a good hypothesis as a starting point so I showed this result for this lexical overlap. Lexical overlap. Study where we see that these models are using this lexical overlap here a stick rather than reason that something richer. [00:21:10] I left out the detail that the training data the model had access to looks something like this so actually only about one percent of the examples in the models training data contained lexical overlap and of that one percent only 10 percent 110th of one percent. Was the label not Intel men there actually. [00:21:31] Were asking the model in order for the Mt to perform well on these red bars were asked St asking it to extrapolate based on 110th of one percent of its training data and if we just add additional examples to the training data so we blow it up so that now actually 10 percent of the training data has lights up and there's a $5050.00 split between the labels associate with the likes of but then the models perform quite well in both cases and so this is a method of data augmentation which has been increasingly. [00:22:05] Studied in an LP especially for issues related to ice and fairness I mean generally improving models robustness to. To quote out of distribution. As to examples. So we wanted to. To look more closely at whether this really makes sense as a solution for improving models in general or for getting models to be more robust in general like is it the case that if data are mentation is a good solution then it should be the case that we can really just. [00:22:37] We could add additional training data and expect models to generalize as expected or as we want them to. Across the board right. So we can't came up with a kind of synthetic set up to test this in a way that we could reason about a bit more. Carefully I'm sorry General set up is we have the machine learning problem we have things labeled as positives and negatives and there is some targets feature in this space that perfectly predicts the label so anything that's in the blue circle the label is positive and there's some spurious feature that is unrelated to the label but over so anything red may or may not be positive or could be next to. [00:23:19] The problem which is I think representative of most machine learning problems these days as our training data is skewed in some way so we've sampled some training set where the various feature actually has very high overlap with the target feature and so as a result the model that is trained on this data could very reasonably think that the scariest feature is actually the thing that predicts the target or predicts the label and and get high accuracy even if it more as the target of each. [00:23:47] Test time what we want to do is train on. The full space or particularly we're interested in testing on the on the cases that the correlate that the target and this various feature that we would want the model to still predict well in the case where the target occurs without spurious or other spurious occurs without the target that's kind of our goal is a robust model form on those sets despite having trained on this bias training set. [00:24:17] So the question is really whether the rate of core occurrence like the assumption with the data augmentation approaches that the rate of car currents of the target and the spurious feature during training that that alone will explain the test performance because when we do data augmentation what we're going to do is basically deliberately manipulate this rate of co current by adding new examples that either you know include the squares without the target or the chart with us various And the assumption is that by doing that we will improve the models generalization to these test samples so that's what we're interested in testing right here. [00:24:53] So I think I said before we set this up as kind of a toy simulated task so that we have a lot of control over how things work so we give our sentences are sequences of numbers and we're training it like a sentence acceptability fast so basically the sentences in the language if it contains the argot feature regardless of the stories feature so each a a few different types of target features So for example in one case the target feature is just whether the symbol one occurs anywhere in the sentence in another case. [00:25:29] The target feature is whether the the 1st 2 symbols in this in the sequence matched. While another one is whether there it appears some duplicate or some pair somewhere in the sentence but it can occur anywhere in the sentence and the last one is whether the 1st and last simple match and then in all cases are spurious feature is just whether the symbol 2 appears anywhere in this. [00:25:53] So we can look at how models perform in this case and we start by looking at how they perform in the setting where the training distribution is the. Spirit and Target are perfectly correlated but there's actually there's no basis for the model to be able to know whether this various or the target feature is predictive of the label right that training to it has given it no incentive to prefer one of the other. [00:26:19] And we can look at the error rate in the 2 cases I talked about before so error rate when the spurious feature occurs alone or error rate in the target for each person and what's interesting to see here is that in this 1st case when we have this contains one as our target feature we're actually not getting 100 percent error despite the fact that. [00:26:39] That there's no real incentive so we're seeing that the model is kind of chip making a lot of random guesses in this case whereas all in all of the other cases we actually are getting 100 percent error it it has chosen to use the target for the spurious feature to make predictions rather than the tree when we increase the training data we augment the train data just the perceiver slightly so we just make it so that point one percent of the training examples contain the spurious feature without the target feature. [00:27:10] So it basically decouples these 2 features then actually in the case one case the models toss the test perfectly and this is an interesting setting to point out because this point one percent is actually what we had in that and a lie that I'm naturally which infant setting I showed earlier this is a teeny sliver of the training data but it is enough for the model to extrapolate and learn the right thing in the case of the contains one feature in all the other cases it actually still completely false. [00:27:37] If we are going up to 10 percent then we start seeing movement in some of the other features although they're definitely not solved as we move up to 50 percent in some of the cases that the model seems to perform well in nearly stopped the task in this last case with the 1st last target feature it's still getting very high error when the when the spurt of target future. [00:28:01] So summary of these results I think. The picture here is gets a little more complicated right so when we think about something I did augmentation is seems to be that we can't really just blame this on the training data if it was really a problem with the training data and data augmentation should work similarly across all settings but we actually see that different features respond differently to the same intervention in terms of data mentation which is just there's more going on in the problem then. [00:28:32] Than just a learning beach or co-occur unstring training right there's some additional dimension to it that affects whether or not the model will make decisions based on the story a future versus based on the current. So that brings to a kind of a last hypothesis here which is that maybe there's not there's more going on so it's not really a matter of whether the future is quote there or not but we need some softer notion about how available the future is to the model when it begins training. [00:29:10] And so kind of stepping back and thinking about what we're dealing with with these kind of language models I think it's nice to draw the parallel to what we kind of think about classical machine learning before we before everything is and they're in a kind of classical machine learning setting we think about 2 pieces of the model right there's features and then there's weights so you and you would. [00:29:33] Spend time in future engineering extracting good features that you think are relevant to the. To the task you care about and then the machine learning part would learn a wait. A way factor that would decide how to how much to care about those different features that you provided. [00:29:51] The thing that happens in this and 10 setting like what we have in these neural language models or any of these neural nets is that they are simultaneously updating weights but also updating features the one the model is trying to learn how to do something like language in France. [00:30:08] And it's coming in with some representation of language to start with there's a bit of a decision that it has to make sorry for anthropomorphizing models I can't help it. I was implying that they're conscious or anything I just I am for foreign fighters everything. So yeah there has to decide between learning better features learning a better way of representing that language or just learning better weights over the features r.t. has and actually can do kind of both of these things in parallel but they kind of are conceptually different parts of of what type of learning is going on. [00:30:42] And so you can imagine that if the model gets as input some. Feature representation that's just a very bad match for the task it's trying to do it might be better to spend a lot of time learning new features and actually. So that it can learn simpler weights over those features whereas if it gets representations that are. [00:31:05] That are already nicely aligned with the test monster you it can just focus on learning a good set of weights and these might be like different different levels of hard different amounts of work for the model to learn better features vs and better weights. So to kind of formalize this notion between how hard the peaches are to extract from the improve our present a sion like you know visually it obviously makes sense if you say kind of think about how separable are the different classes or the different features we're interested in. [00:31:36] There's a really nice paper. I forget. Last year. By. And here they defined a nice notion of minimum description length for thinking about this difference a minimum description length has been around for a while to information theoretic notion but that kind of intuition is that. The m.t.l. is supposed to describe how efficiently you could communicate the labels given the data. [00:32:07] So it's some kind of code length again it's coming from information theory where a shorter code length means that it's you can more efficiently transmit those labels so the. Labels were better encoded given the data. Though that there's a kind of formal definition of it on the other side. [00:32:29] The important thing to know is that the harder the. The features are tracked the higher the m.p.l. the easier to extract the lower them via the. In terms of how this is actually computed it's a. In practice and there's a proof of this in the paper it's basically the area under the last curve as a model is trying to learn to separate in this case blue from red from purple so you can think about that if it makes more sense so kind of and the harder it is to learn that the. [00:33:04] Lower that last Care is going to educate and the higher the area and the. And so this kind of notion if we go back to a picture like this so we can see here that features responded very differently to this state of mentation right so that contains one feature basically just. [00:33:21] Caved instantly as soon as we gave it a point one percent of training data examples to nudge in one direction instantly sauce a task and if we look at that we compute something like the Mt L's of the various and target features. There they're measured in bits so it's the. [00:33:40] Metrics themselves no matter margins the relative one number you can see they're both low and about the same for both this person target future whereas if we look at something like this 1st last one which was very stubborn in the model really struggle to learn even with a really nice training to street and here we can see there's a huge difference between how easy this various features as it is to extract given the input data and how easy the target feature is pacifically this 1st feature is much easier to extract and the target sorry about this is in this work and this is actually my year on the site is wrong this is I clear coming up this year so we wanted to test this hypothesis that the that fine tuned bottles whether they use the target feature is a function of how hard that feature is to extract relative to whatever computing spurious features also exist in the training data and the data distribution so that it co-occur it's between the hard answers feature during training and we've heard of this as the evidence basically the the lower the car occurrences between those 2 features the more evidence the model has to choose one over the other if they car perfectly the model has no evidence to base its decisions on so it's going to be a chance between choosing between these 2 are actually in fact not at chance but then dependent on the extract ability of the 2 features. [00:35:03] So you again so the shining evidence where I'm using that phrase to refer to the car currents between the. Between the 2 features basically the size of this purple region and we do have when we talk about the use of this phrase feature. Again we're not making a causal argument just yet that's next work so we're just going to talk about the performance on these held out test sets that decouple the this very center your future I'm so will report f one score over these types of test sets and we've heard of that as whether the that target feature is being used the target feature is being used and if score should be high. [00:35:41] And then finally when we talk about the difficulty of extracting the features this is really a relative measure so it's the m.p.l. of the series feature over the divided by this or the target feature to the higher This is. Or sorry the lower This is the easier this various feature is to extract relative to the target when it's higher it means that the. [00:36:03] The I always get the sonic. So in this very is m.t.l. is low and the target is high then this ratio will be low meaning that's various It's easier to extract and then as it gets higher it means the target becomes easier to. Create sharks criminals set up so here we want to switch out of the toy setting and so that we can test kind of real language models like her and the models I was talking about at the beginning of this talk so we're going to sign a sentence acceptability task that's using English sentences. [00:36:37] But we're going to kind of design simple template sentences so that we have control over the sprays and her pictures and so the sentence acceptability has is just. It just asked the model to predict yes or no is this a good crew magical English sentence and so in this case the piano teachers see the handymen That's a fine sentence so the label is yes. [00:36:57] If you were changed to something like the piano teachers who seized the handymen that's no longer grammatical right so the answer would be no and the target feature in this case that we'd want the model to use is something like subject for agreement so the fact that teachers is the subject of study and so therefore c. should agree with. [00:37:16] In number with teachers and not with. Which lawyer or any of the other nuns and that would be the target feature that we would like the models to use to make these decisions about acceptability But then we manipulate the data so that there's many other various features that occur during training that the model might use instead so in one's heading for example we just append random words like often to all of the grammatical sentences so that a model could just assume that maybe the task I should do is just basically if there were contents if the sentence contains the word often then it's grammatical otherwise it's not which is not the future we wanted to use it wouldn't generalize outside of training but based on the training data it might be a strong signal. [00:37:58] But also played with some other spurious features so like sentence length maybe grammatical sentences are longer given the training data. May be grammatical senses just happen to contain more plural nouns maybe radical sentences have this feature where the verb tense to a degree with the closest now not with its subject things like. [00:38:20] So we come up with 20 targets various feature pairs. They they're all listed here the details of them are not important the important thing is that they capture a range of. Values of this ratio that we're interested in this relative extract ability for example here's 3 of them Subject Verb ones we also look at things like negative polarity item licensing and flag up dependencies this issue other than tactics magic phenomena that affect acceptability as you can see these range from having very low ratios which means that the various feature is much much easier to extract than the target for having very high ratios which means the target feature is actually much easier to extract and the spurious given the input. [00:39:12] And so for the results we're going to compute Like I said the average f. score over these kind of. Held out test sets so in this case we train on the. Training set with some level of car currency usually a hike occurrence that but we manipulate it during training and then test time we're going to look at the. [00:39:33] The average over over these tests etc drawn from these deliberate regions of the space where this it was at the top. And you can see a very strong correlation here between the relative extractor Filippi of the features and how well the model performs. On average and this is abjectly across a lot of different training ratios the Basically you can see is these each point here is referring to a feature pair. [00:40:02] And we're changing the model using a lot of different rustic representation starting points and for like Roberta t. 5 you can see. The 2 actually you can see all of them really track this clear trend where the easier the. Target is to extract relative to the spurious the better their performance overall and we see that bad performance happens when the. [00:40:24] When the spurious is much easier to extract from the target that's when it learns a solution that doesn't generalize. We see a similar trend with gloves but it doesn't perform as well glove is not a contraction as were an inventing model but a static or demanding model. And just means it's less powerful and it actually doesn't solve the tasks in many cases so I could talk more about that and it was just. [00:40:46] You get an even more complete view if you kind of break this down by looking at that. They use of the various feature on the y. axis. As a function of the co-occur and speech mean targets various during training and you look at this at different levels of that extract ability ratio so for example here we can look at this red bar this is a case where the target is actually very easy to extract relative to the various feature and we want to see why x. this is error rate so we want to see low values and in this case we actually see the model performs perfectly even when it's given training data that provides absolutely no evidence to do the right thing right so here we have you know pathologically design training data where there's a spurious feature that perfectly predicts the label and yet the model uses a solution that uses only the target feature and ignores this 1st teacher and it performs quite well. [00:41:41] Whereas here we see this doesn't happen these are results from birth the trends are similar for other models though here when we find a case where the target is very difficult to extract relative to this various feature it doesn't start learning till we've done quite a bit of update augmentation. [00:41:56] And in general the in-between values track in this pattern so the as this ratio gets gets higher the models perform better with less evidence and I said it that was for Bert that's also pretty 5 in the paper we have for other models but they are what's. Right so take away from there is basically that maybe it's not just this matter of features being there 1st not there but actually. [00:42:25] We need to look at the. Kind of quality of the future representations at the starting point when we start training the model. And that's pretty intuitive I think the more exciting thing here is thinking about one that we do have some ways of measuring this kind of how they're the features are using things like this and the l. metric. [00:42:46] There's also. This kind of implication about thinking in terms of inductive biases so once we have pretend models that have features that are very readily available that. Then we can think of them all as having an inductive bias that forces it to use ERs encourages it to use that feature even in the absence of training or training up and. [00:43:10] So ending just with a few points of discussion on that I think the main takeaway is like I just said thinking about pre-training as a way of endowing models with inductive biases so that really that pre-training process really affects which solutions the models are going to prefer over other solutions and when we're thinking about how to improve model behavior like making them more robust or making that last biased we have a lot of options right like as m.l. builders of m.l. models we can think about tweaking the training data or we can think about. [00:43:47] Tweaking the way that we preach change and we might not always have this him out of control over but so even if they don't maintain seems quite appealing there are often cases where we might not actually have control over what it we train on I'm interested in the setting Personally I like to assume we don't get to choose our training data and that's coming from kind of inspiration of human learning right like you learn from a highly correlated world where you might not have the ability to. [00:44:15] Solicit examples that explicitly decouple. Confusing features and yet humans are pretty good we come with a lot of inductive I.Z.'s and we're pretty good at learning generalisations despite having very few training to. Do this kind of debate is pretty interesting in the kind of linguistics and cognitive science world and goes very far back. [00:44:38] In terms of how much quo in the structured humans have when we come into the world and start learning and this has often been kind of pitched as an argument between neural network type learning and and you know more future based logic based approaches but I think that that opinion has been shifting of has careers and this kind of results to me give kind of an encouraging way of thinking about pre-training as a way of getting that kind of finish trucks or so we can retrain models to prefer compositional solutions or or more of shock structures and then maybe you will get better generalization behavior despite still training on that. [00:45:22] And I think finally like we mentioned beginning I've been increasingly interested in the kind of grounded language in robotics so some of my current projects that. That I'm really excited about and would love to talk to other people about is whether we can get this for if we're thinking about learning good models of language how much of this in a structure needs to come from non language tasks though a lot of the things that we need in language refer to for example notions of agents in patients taking actions in the environment and the idea of objects persisting over time those kinds of semantic features play a big role in the meaning representation is one of built so maybe we can learn that structure from tests like vision and. [00:46:05] And kind of embodied interaction and then only later add in language and take advantage of all of that in a structure that might allow us to learn better generalisations better groundings and get out of this kind of. Text only kind of chat but setting that our language models that are there in this kind of uncanny Chaput setting. [00:46:29] So. So so that's a direction that I think it's really exciting and problems. And that is all I have so I am happy to take questions I think we have plenty of time thanks there thank Yes thanks Mary I probably are on me and you're happier clacking. There are some questioning in the Q 8 section. [00:46:55] That. They'll hear. A question from Sara do ago Wired prior year and probably innocent doctor in contract award representations that paper come from crude to end the stockade were bad enough enough force and they are the compact model. Good question. Well we don't need a side but might as well go back. [00:47:23] Yes So. We did it. I think multiple ways I'm Since we were testing lots of different models we defaulted to the 1st layer of the contextual model in each case. Those in like the Elmo model that 1st layer was a glove so that was the same as the static in betting's or bird they're like these were peace tokens. [00:47:50] That are every bit different there are trained. Within bird though. So this is just showing the results for in the paper we had results from from several different models and in each case it was the 1st layer of the conjecture as model but we did know how static were embeddings performed because the 1st layer of Elmo was acquitted. [00:48:16] Yeah there you had adequate or nos plates on put in the tag or a put on the puree So there's another question from. The ask or it seems to perform bio on linguistic problem past race or result from 20 non-banks trim tasks I have or the fine tuning stage you're really can't you pack in the downstream past performance a la. [00:48:41] Those 8000000 the feature us for downstream past are. Truly greenstick Peters Yes That that's my interpretation I think exactly so when I see results like. I mean like most of what I presented here it's basically that the it seems like you started with this really nice initialization where the model has these nice linguistic features and then you're saying Ok now you should do question answering and it actually ignores. [00:49:14] A large number of those that wish to creatures are obviously isn't ignoring everything you got from pre-training because we see a significantly better results when we start from the preacher and model than if we were trained from scratch. But the features that it's using are not the full set of features available to it and instead it's. [00:49:30] What we tend to see especially in things like squad and. And and naturally jump answer questions in an actual image in France is that the models are just using a lot of juristic stright And so they're learning a lot of task is to fix your wrist x. and that's what we see unlike the lexical overlap example and so that work that's happening during fine tuning is that they're learning to use a lot of curious ticks that lead to really good task performance. [00:49:57] But they're not doing the thing that you know I was hopeful they would be doing which is they going out how to use the great features they have available to them as I think part of my current understanding is the reason they're not doing that is that features are not available enough right so if we can make those features as salient as available as those here are sticks then the model might actually end up using the good features but if the surest ticks are easier to extract and there's lots of them right then it's a little more work for the model to figure out that it should use it and see parsing but I think what we end up seeing is that the models are not there solving downstream tasks in ways that are very germane here 6 and make a lot of sense and lead to good performance but don't take full advantage of the solutions they really should be able to learn based on what they have. [00:50:51] Yes now here and not only a question from the South Bank graph though I'm probably still a little bit of background so sorry it's County as Perry working with Mark Read ourselves to actually I mean you all her points also here I don't know that paper that I says Now it's not really. [00:51:10] Ok because they're a little bit bad grandpa. Good to see you Paul and of see you yes over the Ira is allowable asking the questions are asked since that I'm the outcome to us for our sound possible I'm verse in early April being my 3rd a complex it. [00:51:32] Problem I have you made it really that had it you are for us hypothesis us and I am. Yeah actually so whether you like these types of results whether the the results would look different if we were to measure this in m.t.l. rather than accuracy and actually I think in the supplementary we did end up in clearing those results or with maybe footnote so as were Can this paper that's one and deal where it came out and we really ran some of this using n.t.l. and we didn't see the big difference if I would have to double check the exact results we didn't see big enough differences that we felt it warranted updating the main body of the paper I think that was the main takeaway. [00:52:18] But yeah I would actually have to double check it I'm pretty sure we didn't see it looked about the same like they were getting significantly less excitable there were differences. There are huge differences if you look at the average over the embeddings versus just the last layer and the 100 memory I think we initially ran m.t.l. on just the last layer and all of it like got way worse and so we were kind of excited and were like That explains a lot the teachers have become less extractable but. [00:52:48] But then it in fact once you look at the average over all the layers which is how we had on the actors is. The. There's not a big loss so we do see this it consistently across all of the fine tuning runs is that the top players do lose that linguistic information so that suggests that those top players are getting very specialized for the test they're fine tuning on which I think makes sense and it also is consistent with the previous what the previous question was asking right that the they're learning features that are. [00:53:22] Fairly different than these and we're sick fishers and those top players get overwritten to focus on those these new features that are kind of test specific and the linguistic information gets pushed back to earlier layers and it's not really used. Any any more questions from audience yet I have a high level question perhaps more like a human so Maggie be a friend to actually packed her it's our proposal and sat off at Transformers so there's that to be from whites to a layer that are like a place that can share the number of this account that and I guess why there are so many different ways that people it's that beat to do with the noise in our wire explains the try to explain that and how. [00:54:10] Those 10 goals would impact some of the Com question or comment from your work or think that you're on that basis matter they will care if it's the racers out model x. acto trend. That's a good question it's kind of hard to say I mean I think architectures do make a big difference. [00:54:33] Like. So I guess that if the main conclusion but I think that my main conclusion is I think intuitive I would imagine is true in many settings which is just. Basically from the point of view of a fine chain model you can either work with the features you currently have which are right there and you can learn wait over those or you can work a little bit harder to. [00:55:00] To pull out. Better features and then their rights over those and so if you have to work too hard you're not going to bother doing it even if it comes at a loss hit to accuracy or you're going to find some other solutions obviously the architecture affects how easy it is to find those features so like in some other work we've been doing with. [00:55:20] Looking at like learning logical representations like and or we see a really big difference between the type of signals that Transformers able to readily extract and the type of signals that. Chamas able to. Easily extract right readily extract. So I think that architecture would have a big effect on the behavior of the find to model it the fine tuning architecture I'm Because that Alex affects how easily it can access different types of features. [00:55:54] And that goes very nicely into the inductive bias as well so we think of like architecture as one of the main sources of inductive bias like analysis has a nice inductive bias for caring about sequential order whereas maybe a transformer does not have that bias. And so I would think of it as like the architecture and the pretend representation together kind of define the inductive prices model. [00:56:20] As opposed to just one of the other in terms of architecture the pre-trained model I'm I'm not sure I imagine it does affect the representation but I would be just completely speculating. There aren't seeing things there anymore questions about time. Thanks again for giving a very intriguing tog cancel proconsul you will laugh at least the challenge they're going to look forward to the new with mil saying that they are talking about in the last. [00:56:51] Well yes thank you so much and good to see a fun and thank you for the question.