[00:00:05]
>> It's my pleasure to introduce our 1st speaker of the year. Of the Grammy the remaining money and. Pretty close one syllable. From University where he worked on the language understanding so very to read every document in the universe and make them and I get up but understand me there's a question answering thing like that right that the.

[00:00:42]
University of Washington and also one of the early members of the Alan you know those of you on the Deuce nonprofit in Washington basically trying to solve all problems and get them started on the book I hope and so much more time. All that said all right. Thanks Mark for inviting me and thanks everyone for the attending so I have a whole bunch of things that you can see the title is pretty long but this is some sort of underlying theme going to bring all this stuff what I want to talk to you about today can people in the back Hear Me Ok so well Ok.

[00:01:29]
What I want to talk to you about today is sort of 3 application 3 areas one question on thing and the other is trying to model knowledge about events and a tiny bit about machine translation the kinds of things that I want to talk about here or I would 4 days into how we could try to reuse new tomatoes how we can try to decompose and put them back together in some ways to use in some target applications and maybe how we could try to control how the student markets actually work in fact so those are the themes that I want to touch.

[00:02:05]
Sort of made the stock largely forgotten in the audience so feel free to interrupt me of the stuff that needs to be explained happy to take questions in the middle. This morning whole bunch of content I had to I'll. Try to get through as well as again Ok so let's start with well they help you.

[00:02:25]
So I'll start with a definition by example Mother helped you is I think somewhat hard to define in a language like many other language does I believe the way I would define this is through this example we have a question here would say it's Which city was Facebook launched in and you've given us input some piece of text which contains the 2 sentences that are here one conveys that Facebook was launched in Howard University another one says Howard University is actually in Cambridge you're supposed to combine these 2 bits of facts and that I have the answer to which is Cambridge so they help in this sort of loose sense but I suspect you need to combine more than one piece of information that's distinctly says in the text and you get an answer is that smart for me.

[00:03:08]
So in this space I want to talk about. The piece of work. That the lab is when we're doing the 1st one is how can we use models that have been trained to do this task that is called entailment how can we train take a model that's trained for in a woman and use it for a cure it's like what's the connection how can we do this so this is where we use comes in.

[00:03:34]
The other bit I want to talk about is how can we get a train model. That's train for doing question answering how can we get it to run on a resource constrained device like a mobile device can we get these models to run faster than the smartest as is made and really slow How can you make them faster that this is where I'm going to talk a little bit about decomposing in your model so that it can it can run faster.

[00:03:59]
And the last one is something it's really an aspirational at this point we don't really have made a lot of progress but I think it's an important problem in engender in an LP so. We build the smartest we have lots of data sets what many n.l.p. does we build these mortars and we get good performance on the data sets but you really can't get in to see that the models are in fact solving the task that you have that you define the data sets for so in this doc in this piece I'm going to tell you a little bit about the difficulties of having multiple data sets and building models that actually try to solve this thing but they actually are not talking in the sense that they don't really combine multiple bits of information so what can we do about this so let's get started with the 1st bit this is joint work with the aid to instill times to keeping my connections to them this is what led by Hutch today to my student along with Kwan and folks at a 2 hour show but a volunteer chalkboard.

[00:04:57]
So there are these 2 areas in one is called sexual and then went on the other risky way. This works tries to bridge these 2 things let me again try to define through an example so here trying to define what entailment means in a logical sense if you're given 2 pieces of information so Zuck studied in Harvard and stuck lived in Cambridge the question is does knowing that Zuck studied in Howard necessarily logically entails that Dr lived in Cambridge so the answer to this is probably not right in the strict logical sense he could have been a commuter but he can frame a softer version of this problem which is often referred to as natural language inference where you want to say there's the piece of information in one sentence give you some probable reason to believe the information that's contained in the other sentence which sort of a softer version of the problem.

[00:05:51]
So this problem is typically been studied at a sentence level so I give you one sentence which we will call the premise and I give you another sentence which is called the hypothesis and you want to say whether the information in the premise supports the information that's contained in the hypothesis so we've.

[00:06:08]
They've developed a bunch of data sets where this so there was this ordinal data set of corporate skull which launched this this problem in the in the piece here and then we had this recent large scale data set called. From Sam Bowman and others at Stanford this is a really large scale data said with close to a 1000000 pairs of sentences where you know if information in one sentence supports the other contradicts or it's neutral to be otherwise it's all we have large data sets so what happens when you have lots of other states you have large number of models where people throw their favorite model at this problem basically all of neural models today there's a leaderboard we just have a bunch of choices.

[00:06:53]
Let's look at Mel help your way again this is by definition we saw this before and there's again a bunch of datasets for this problem also there's been some some choices we have here and again in terms of the margin space there's a lot of choices this really While there are some influences in terms of architecture what we learned from entailment have been somewhat apply to q.a. and vice versa this is really not been any Advent where you say Ok let me take a margin that's trained for entertainment and see if you can use this for so basically started with this question how can we use an ally data sets and the more does it have been built to do it what does it take to do this and of course the reasoning is they the some thought that has been gone into this in a late model is.

[00:07:37]
Can we somehow deep or push the smartest to work for a cure rate if you're pre-training On a related dusk then that's what it should be helpful for the Stargate that's what you're interested in so this seems like a reasonable thing to do. So the question then is like how do you frame a question answering problem using and then went basically it's assumed that we had given a question and some candidate answer and we want to check if this candidate answers is the correct answer for the question given some textual knowledge so you want to use the text that you have to verify if the candidate dancer is going to be the correct answer so you can already see how this can be framed as an investment problem but what we want to do is to want to build a model that does this job by taking a plea change.

[00:08:18]
And somehow using it to build out the skill or more like that so. So here's a simple example of how to do this by 2 I have a question basically the same one where was Facebook launched I have 3 candidate answer choices the 1st thing I do is to create a answered hypothesis Ok let's take Cambridge of the answer can we want to keep a system which would verify if Cambridge is the correct answer and the baby is going to do this is by 1st translating that answer into a hypothesis right and that's a hypothesis which is basically Facebook was launched in Cambridge so this is the thing that we want to read the fine even some textual knowledge let's assume that that is just one piece of sentence No it's actual knowledge which is this premise that we have Facebook was founded in 2004 in Cambridge so the question now the multi health question now becomes an entailment problem where you're trying to verify if the information in the answer hypothesis is entailed by the information in your knowledge which is this particular sentence this is straightforward framing of q.a. using internment however remote realistic example looks like that's right we don't just have one sentence we usually have a collection of sentences and we want to say does the information in this collection of sentences together somehow entail the answer hypothesis that we have so this is the problem that we're trying to solve.

[00:09:33]
So if you have. Said before the emulated if it's our sentence to sentence datasets and therefore the margins that we have learned or modified can take a bed of sentences and tell you where that one entails the other so if you want to directly apply these models to this particular framing there's a couple of straightforward ways of doing this right one is you could take each sentence you have in your actual knowledge and then individually check if that sentence entails the answer a part that says I can make independent decisions and then you somehow combine these independent decisions together and try to answer the question the other thing you can do is to concatenate all the sentences you having to sort of a single piece of text and then say Ok does this textual this premise which is basically concatenation of sentences together somehow entails the hypothesis.

[00:10:20]
As you can guess neither does actually particularly that I had to wait to do that the independent decisions basically cannot aggregate information so each sentence may only contain partial bit of information so you may not be able to get the correct answer and when you can get in a text together it's a divergence from how they intend in models that the models are trained what sentence to sentence if you can get a bunch of them they are not going to work with and they can also get distracted by.

[00:10:43]
By sort of additional information in the sentence because there is only a small portion of the text that is relevant to actually answering the question people good so far any questions. So so do the Ben we so when we thought about this Ok the basic question to ask is look at what the small they help reasoning it was so I'm going to argue it involves 2 things that force is that you need to somehow identify given the set of sexual knowledge sentences you have with sentences are really relevant for figuring out the answer to this question and having somehow softly identified with sentences are useful you need to somehow combine the information in them and then check if that combined information entails the answer hypothesis that you have Those are 2 steps that you need to do and you can see that this sort of you know if you squint hard somehow you can see these are doing some form of entertainment at least right so I can check if the it's basically I want to have some capability that checks information in one sentence with the information in the end and I thought this is so this I could use the and then when margins were basically somehow.

[00:11:47]
The 2nd one is where I want to compare the information in a bunch of sentences together with the information the question this is the one that sort of requires a little bit more work so what we're going to do is to say Ok here's a way to reuse pre-trained and then models for this task we're going to 1st build what we call it elements more you just going to take the sentences in your knowledge and then provide elements distribution with them basically just the relative importance of the sentences with respect to the question you want to think about it like that and then another nor do which we call the aggregate of more you which is going to use these elements weights to pick up some important summation from different sentences and then do this entailment decision so as I said before the relevance more to the somewhat straightforward to do I'm going to apply the and then when model individually or what independent sentences Yes I said that was a bad thing to do a few slides back but in this case we're only interested in sort of providing a ranking not mislead figuring out which is the relevant answer hopefully that ranking is too useful.

[00:12:44]
And the key thing is this basically aligns with the entitlement datasets because they are sentence to sentence here we have sentence a sentence so this should this fits. But the aggregate Immortalist set is a little bit tricky here the aggregate or more do you need to do the following things I did takes in multiple sentences at input along with the answer hypothesis that we're trying to verify and then it needs to basically produce representation of each of these sentences and once we have these representations it needs to be combined into a single percent nation.

[00:13:16]
But before we do that we want to use the relevance weights that we get from the relevance model the 1st of prodigious we softly figured out which are the important bits of information so we take the representations weighted by their elements information and do some kind of a joint operation that combines these things together and then using that representation we want to make an intelligent decision.

[00:13:37]
It's a basic way to construct this kind of a model now. So the center box is actually doing an entailment function our goal is to figure out how we can repurpose and train and then model to do this kind of a box so we're seeing that there are sort of these green and gray boxes that we have how can we use a pre-trained And then when function to implement these green and gray boxes to do this we're going to pretend that the internment model is a loaf of bread that has been sliced basically to indicate that you know the environment mortise that we have today are all new to mortals with multiple layers they do sometimes for missions and then they do some decision making at the end.

[00:14:18]
Anything. Any of that analogy would have worked but actually like spread. I said we should have done the lead but he disagreed and he won. Ok so basically the loaf of bread the slice of bread takes its premise and hypothesis and gives a decision and then decision now we're going to take this pre-sliced bread eyed and then we're going to slice it into to have upper and lower half the upper half is going to be a green box in the lower half is going to be the gray box in the figure that I showed So this basically means that you can take the lower layers of the investment model and use that to construct the representation of each sentence.

[00:14:55]
This is going to be done in conjunction with the answer hypothesis is going to give you some presentation and then we're going to apply a joint operation that combines the sentence level of presentations into a single one and then you're going to use the upper layer from the 10 more that you've broken up and use it as if it's sort of good came from the original entailment model so that's our architecture so the lower layers are played sentence wise and the upper layer is applied better for us.

[00:15:24]
And of course we're going to use the elements weight to scale the sentence with a presentation so that the upper layers can consume this. But you know why slice the bread in only one place we can slice it in multiple places who knows what these new layers are capturing each latest doing different kind of stuff so you could potentially.

[00:15:43]
Slice the in Denmark in multiple places and. Something representations you get you can concatenate them and then you can throw them together. So that's our full aggregated model so given this and I had a model that we have is as follows so we have the relevance more to which is using the pretend and then more to produce a sentence was relevant weights and then you have the aggregated more do which does this weird slicing thing in joining stuff together and then we do the Indian So we call this model multi The act on them is mine expansion is my students multi-layered aggregation of sexual intent representations.

[00:16:21]
Ok so to sort of finish the description of this architecture I have to tell you 2 things what if and then one more to be used and what do we do with the join operation didn't do anything spectacular here we took a model that was girl that was popular then e.c.m. basically buy a list in base model it takes the sentence and produce that answer by listing move it get in and go ding and then it does this thing called Cross attention basically every token in the sentence is compared with the every token in the answer I bought this is when you get an attention matrix and that is then used to somehow produce a combined of this invasion which you spend used to make an intelligent decision sort of basic architecture for doing and then meant we slice this model into 2 layers spun inside the glass attention Laden one at the.

[00:17:11]
Higher layer the join operation is actually quite straightforward in the case of cross attention basically the cross attention laid out puts a matrix is an attention matrix that tells you the importance of each token with respect to each token in the answer to why this is so this you get the sentence level attention matrix you need to somehow combine it and produce a paragraph level attention matrix and these are tension matrices are now weighted by that elements weights right based on the sentence from which they come from we get it elements wait for that sentence so we basically multiply each element by this weight and then we do some normalization and you get a paragraph.

[00:17:47]
So nothing nothing special here is a basic way to accomplish this Ok. So in terms of valuation I started by saying the direct ways of applying in Denmark is suboptimal because they have weaknesses I don't doubt that indeed in practice they don't do that welcome back to multi multi that's much better we also evaluated the against other larger transformer based models so we had to remember I would invite your model is just made from repurposed easten that's all we have so we can buy that against a larger transformer Martin said pre-trained on language modeling to us and then fine tuned on multiple question on things datasets also so you know that is most for one dataset which is called Open Book q.a. the model does better then the Open a transformer which was the best model at that point and then condemn but I believe you also had a another transformer model which was trained on this dataset called race just as a large question and it is said again these are much bigger Marta's our use in base model actually does better than comparable to the ensemble based version of the dating method.

[00:19:00]
A similar study exists for another dataset called while they are sealed so. Basically there are sort of 2 takeaways for me from these results it's just that you have you can take them and then demanded if it's are actually useful for doing q.a. they can retrain and then more to see it actually can be helpful for q.a. and comparable to 2 large markets.

[00:19:25]
So there's really nothing in this architecture that says you can only use him what we showed is if you have a model that has some layers you can divide this model into 2 pieces and somehow use it so we could use Bert for example or any other large transformer you have and do this kind of a model also.

[00:19:45]
We are working on this we don't have results yet but hopefully we'll have something soon so the interesting thing with respect to bird or any other language model at this point for me is that. My It might be the case that what many does that we have the training data you have is easier to get from a sentence level short of context so if you want to take a dataset which is defined over shorter contexts and you want to play it for another task where the context is longer maybe this is a way to do that kind of a transfer to accomplish that transfer.

[00:20:22]
So I basically said boost to take away any questions before I move on to the next. Thought about it a good way to do that but at this point we just. So we don't have different. So when for the lower layers we use the we use the same why do you use to produce a representation so that we are not to be getting parameters for instance so there's no we just be using the same model.

[00:21:03]
That is a loss of so in some cases for example if you're thinking about a paragraph in the sequence you informations of you don't process them together that is definitely some loss of information. You could think of a good model in those settings to see if that could be addressed to the.

[00:21:23]
Model welcome. Bearing the Knowledge Graph. I. Don't know what hi there. Going. I'm not sure I see the connections clearly so. Who are you saying that we take the text construct a knowledge graph out of it and then figure out how to answer it based on this kind of a model.

[00:21:58]
You have come. So I have we haven't thought about this but in general the kind of problem I'm. Interested I have is. Changeless that is definitely a lot of information when you go from text to a knowledge graph so increasingly what we're seeing is sort of treating trying to encode text as a whole but then when the text becomes much larger I'd like I have to let say I do a google search to get a bunch of documents and then I have to listen to a whole bunch of documents and then give an answer maybe a knowledge graph that it might be useful but I hadn't thought about this.

[00:22:41]
So. I want to move on to the next bit is basically trying to take again a preaching model because it's been trained and developed for literally running in commodity devices and how can we get this model to run on a target environment which is different from the environment that it was would only train for I'd like let's say What if you want to run these moderates on mobile devices.

[00:23:06]
So. You know the markets are at this point of really big and they have a lot of computers so it's really hard to get them to run so if you want to get the smartest think of an end to end question answering system I have a large collection of documents I have a question I want to be answered there are 2 sources of bottleneck in in terms of computer 2 things that contribute a lot to the computer one is the number of paragraphs or a document that I pass to make you a model and the amount of compute that my model does on each needs such a paragraph both are things that we want to reduce So we had some initial work that basically said that if you want to do on device evaluation basically if you have a mobile device where you're doing the same to into a system current model stake as much as 8090 seconds to answer a single question to really get it will this wouldn't even work like Martin's business but all the more that.

[00:23:56]
If you have to process let's say 100150 paragraphs that was the optimal size to process to get the high levels of accuracy that we wanted so if you want to process stuff 350 paragraphs it takes a minute and a half bad thing to do so one thing we said is Ok can we try to optimally decide given a question how many paragraphs to process or even better as we keep processing the paragraphs can we decide to stop early like that we know we have gotten the correct answer we can do tricks like this and we can actually get it down to about 5 seconds or so.

[00:24:25]
And the other thing that we're currently focused on is trying to reduce the compute that you do provide a graph let's assume that now in the world of a given about a graph I'm asking you a question but I wonder if to be running on my mobile device right.

[00:24:39]
Got in March that we have a really huge they don't even if we don't. The form but let's assume they can fit with some drinks and still they take a lot of time to run so we have some ongoing work. Where we're focusing on this idea of decomposing pre-training model so that it can actually run efficiently on a mobile device.

[00:24:59]
So some copy paste failure. So this is the slight I want to be so one thing we could do is to do partly compression I don't have a huge bird I can compress the breast and smaller Martin mean smaller lower amounts of compute and hopefully that can run faster discuss memory benefits extra turns out the bad trade of at least as of now ideas more like a little bird he said the time we looked at this there is a high loss in i.q. to see when we applied to the Q a to us that our models that can be the word can be compressed to a small margin it works well for the language modeling task and some of the sentence level does but work your way seems like it introduces.

[00:25:46]
A high loss in accuracy and you actually have too many cases do the. Leap retrain the martyred. So if you take Burton if you wanted to be mimicked by a smaller bird then you have to get the smaller bird to do the same language modeling That's right so there's not.

[00:26:03]
Not a place where you can explore the same choices so let's say I want to compare the tradeoff of having the modest price with this accuracy then there are many such models I need to do and it's not the easy way to get at this so one thing that we looked at is to see you know what is the source of main source of margin a good computer.

[00:26:23]
Main source of computing this morning's in the transformer base model they look like this sort of bunch of layers we don't exactly look like this but sort of has got some layers like this I don't each layer we basically have to compute the representation of each token in the input by comparing the representations of all the other positions so basically you have and squared complexity if n. is the length of your input so if I think about a passage in the questions for denoting p. and q. then I'm basically computing.

[00:26:54]
Cupla speed a whole squared complexity this is basically saying that I have if I'm computing the representation of a passage token I have to consider the. Representation of the question tokens also to do this right so you probably guess where this is heading. A question to ask is it necessary that we do the whole input white attention in all the layers in a transformer may not mean a thing to some extent you can look at the sentence locally and say Ok I can understand this why do I need to look at things that are far away to make an understanding of this case at a lower layers maybe I could just get local representations at the lower layers maybe at the higher layers I could get input into position that's a way I can deduce compute.

[00:27:35]
So with a small experiment to check if this is in fact holds what we did was we took a question and the corresponding passage you had for that question and the computer presentation right at each layer you get some representation for each token in the passage now if computing the passage representations has to necessarily depend on the question then if I change the question completely to something else then this passage representation should change quite a bit so what we check was basically used to look for this trend that that is a higher variance in the upper layers than in the lower layers so if I change the question the lower later percentage has been changed as much and granted this is sort of reading a bit of p.d. if the checking vectors representations exits like Saddam but the trend is sort of what we thought we would expect that if you change the question that the representation that the higher latest produce is much different than the percentage that the lower layer produced so.

[00:28:31]
We basically figured that local attention might be enough so we did the simple decomposition basically saying we no longer going to compute describes attention. We're going to have to pass a few representations computed by attending to only passage tokens questions representations only by question tokens and then at the higher layers at some later that we decide we're going to do the regular thing.

[00:28:54]
So this is again we're taking a plea to a mortal and applying it in this way this is a direction from how to model the strain so. Basically the the the thing we get more than the savings in computers right here is you basically get all of q. squared and then Orpheus quid that's the complexity of ice but the more the biggest savings that we get is actually now the passage computation is independent of the question which means I can reprocess all the best it is and then cash them and then I just look them look them up and then I get I just need to feed them at the top layers.

[00:29:32]
So that's the biggest benefit we get so we can cash in the u.s.. And the even though this is a divergence right the model architected it's largely the same there's nothing that I have done. To there's not a drastic drastic change compared to what margin compression that's one example of going from $12.00 to $6.00 with fewer tension headaches if that's what is the architecture largely remains the same therefore the pre-trained rates that I have are largely Ok so I just take the pre-trained rates and I do some fine tuning on the target task I'm already in a good space I only loose about 3 points and I can see on target question and get it but even this we can improve by doing this sort of what is called Just knowledge distillation of distillation ideas so.

[00:30:18]
Right so we want that model right which is so decomposed model which is much faster to behave exactly like this morning so in late k. plus one what about it's computing it's now should look exactly like what this lake a plus one is computing as long as it looks closer and maybe I'm not losing a whole lot of information so I could add some additional.

[00:30:41]
Losses to my objective knowledge distillation losses doing fine tuning that can even mitigate those small losses that we get so this is a simple way to further improve the performance so I have some results here that this is just showing how to decompose burked we get about. 3.5 times speed up.

[00:31:04]
In terms of runtime we get 7070 percent reduction in memory usage because you're not keeping the representations of the lower layers around anymore you only need to pull the kit latest representation during downtime so you get a large savings in memory also we have missiles with excellent which look similar as well there's also some numbers in the bottom row that are talking about sentence to sentence more that you can pretend if one sentence is already given to you then how much savings you get you get actually about twice so basically the broader takeaway in this thing is that you could actually.

[00:31:40]
So. The models that we use in energy today are getting bigger and bigger one them to run on mobile devices something needs to be done. Doesn't mean that we actually have to give away or throw out the sort of benefits you get from having multiple layers of transformations right but many heads that are there are probably ways to sort of do some small changes to the model the completion is one such idea still retain the model structure and still be able to do.

[00:32:11]
So this kind of this is something and putting things back is. As useful the other main message here is that it's really important to understand the bottlenecks in computer if you're doing it on other devices there's Nandan non-intuitive systems things that crop up and some of these changes don't really work very well on mobile devices so it's really really critical to understand from a system standpoint what is causing the bottleneck.

[00:32:39]
The other thing is again this notion of you know do we need to have food content fully input context wide processing for all facets I don't think it's a sort of a settled thing we might be able to depending on the task might be able to pick up local contacts trusting alone and then sort of assembling these local contacts representations that are hired to do to do interesting things.

[00:33:03]
Any questions on the. Big. Question. Is what. Is. The right. Way. You will come. Here. What do you want what do you. Hear that you can. Come here. To cash that's the biggest savings here is not yeah so this idea I mean there is some one time savings there as I said instead of.

[00:33:54]
So you lose the 2 p.q. might be the whole square that goes over complexity but here this model has he was squared north to square the he was gone but that's not that doesn't actually practically give you a lot of savings but the piece where it's now replaced with a constant because they can just look it up that's where the big savings.

[00:34:14]
Yeah. Yeah. Yeah so I'm sorry sort of. Tendency to skip 2 things so if you look at the bird base and bird bird base the 2nd and 3rd columns so there is a small drop in like you see right but it's like one point something. Like. This part.

[00:34:52]
We doing on time Ok so this part. As I said before is something that I'm very excited about but it's really the stock is mostly aspirational Ok so we don't have lots of direct results to talk about. So we're again focusing on well they have a problem now to basically you know by now we know the story datasets come out.

[00:35:18]
And there's a lot of progress there's some leaderboard out there people put out and there's suddenly whole bunch of models and there's some leaderboard frenzy but really this does not indicate that we've made progress on the task because this has happened in multiple areas including some nice This is from here on.

[00:35:36]
Isn't really what happens right we create datasets by going to. Cultures and so we give people. A large text and ask them to create questions that we hope. That's what will they help question I think that basically. Gives you a quality dataset it's a good question and it bears but the assumption is that these data sets in this data sets that actually for each question more than one factor is actually necessary to answer this question this is some assumption we make about the data set and all of the code facts necessary to answer the question in this part of the input.

[00:36:15]
So I should say when the data fits are like this like you have a paragraph this is the paragraph that the system should read and that is a question given to you in the financial given to the system should read this paragraph and the question and give you the answer the good data set is where this paragraph will have 2 facts somewhere that at least 2 facts that need to be combined in order to arrive at the answer and this is the hope that we have.

[00:36:44]
So the hope then is that when models are trained on this data set they will actually use the information in the input text and will actually learn to do multi copies. But this is not always realize the scope why because margins can exploit shortcuts that exist in these datasets in terms of reasoning and any other artifacts that exist and basically solve the problem without actually doing multiple placing there's been a bunch of papers on this recently thing that you can ask is Ok what do we do in these kinds of settings one thing we could do is to carefully study the model behavior but say you have a dataset you have trained your favorite model you carefully do some control experiment and see if the model does that I think you can understand understand better the models doing multiple multiple reasoning and then you can say Ok now that I've understood the kinds of cheating that this model does let me go back to my dataset creation and then fix my dataset creation so that this kind of cheating can happen right so I create harder questions where this cheating.

[00:37:46]
The other thing we can do is to think about some clever ways of evaluating these models so that they are not just evaluated on question answering accuracy but instead you ask them to say Ok Which of the bits of information you use to deny this answer right and then you might have marked up Ok here's the sentence one year sentence 2 I use these 2 to give you the answer right now you got it that's the more or less when they actually do them but that alone is also turns out not to be enough.

[00:38:11]
So. When we looked at this we said Ok really going back to the dataset and having people give us hard questions is really difficult. Just getting questions and answers that are consistent with the passage itself was hard to ensure that this questions required will take up reasoning is not something that you can easily inforce and get data at large scale So one thing we want to do is to say Ok can we take existing data sets and somehow transform them so that we can get and be a model cannot do well on the status at this transform data said unless they actually do multiple pieces so this is the aspirations over have basically 2 goals one is to create such a dataset and the other is to be able to have some kind of a measure that tells you given you give me a date if it were multiple presuming and I will tell you how much actual miles they help reasoning your market.

[00:39:07]
Your dataset requires or you give me a model and I will tell you how much money to help reasoning your market does given a particular dataset. These are really hard goes to have that sort of what you're looking at so the basic idea behind the transformation is something like this so I give you a question and you have these 2 facts that I already know should be combined to answer that particular question if I remove one of these facts then a basically a hobby model the market does multi-car presenting should return or should fail and I did not be able to give you the correct answer because it doesn't have the information to give you that answer.

[00:39:45]
Anon happy model on the other hand something that uses some kind of a shortcut so an example of a shotgun I read was which city was Facebook launched and I had to piece of information if the modeless using a shortcut it could say Let me look for the cd underdone it I don't get about Facebook or anything else that's the kind of shortcut that we want to watch so a non hoping model which uses this kind of shortcut will still be done the same answer because it was never using that 1st sentence in any way.

[00:40:13]
Ok so this seems like a. Reasonable idea so we can take any question that you have in your original multi-car data set and create a pattern question that we remove one of these facts and replace it with something else and now we create another question where the answer should be.

[00:40:28]
Right and I don't know what the answer should be. So what we can do then is to create an evaluation machine and then which we call paired accuracy which is basically a model gets one point if it answers both questions of the question correctly for the question when we give the full input it should give you that I'd answer and then when the question would be modified input it should give you the answer I don't know I don't know if it does well on book that's a good one point so what sort of demonstrate this kind of thing is actually reasonable 5 Ok.

[00:41:04]
So. We can take a model like Bird and we can basically show that you could run this model in a single fact Morden meaning video hiding some of these inputs and then ask them order to combine its additions and you can basically demonstrate that on this transform dataset that we have the board model when it runs on the original data set.

[00:41:29]
In a single fact more can actually get as high accuracy as possible but on the transom data set it actually gets 0 and so this is the kind of the so that we want although I sort of show an empirical point here that this is actually possible conceptually there are lots of holes in this kind of transformation so sadly we can't really get into that this kind of transformation alone.

[00:41:50]
Will guarantee that no model can do just single fact reasoning and get non-zero score so it is not possible to actually get this kind of behavior in theory while in practice it might actually be possible Ok. Since I have about 5 minutes left I'm going to try. I'm going to switch gears and just talk about a little bit about multi help sorting machine translation here.

[00:42:38]
Ok. So I talked about reuse a little bit of decomposition and this attempt at multi help reasoning is sort of identity trying to get control by 21 the models to behave in a particular way not just get good accuracies on the data set but we want them to actually do multi optics and.

[00:42:57]
So we also did some work in Machine Gun station where we sort of tried to do a little bit more control so in machine translation usually you have an input that goes in in one language you want to translate and produce a sentence in the other language so what if I wanted to have.

[00:43:14]
More diverse set of translations so I won the translation to be controlled in some way not just faithfully translating the meaning but I also wanted to be written in different forms let's say So one angle we were looking at is can we ask the decoder in the translation model to take us in court a part of speech tag sequence and then translate the input into that part of speech back sequence in the target language so we have.

[00:43:40]
A latent variable Mordred that does this the key challenge is that when you introduce Laden variables in your decoder. You're basically. At the mercy of inference tractability so you can basically if you have a latent variable and then the target output that you're trying to generate you have to margin its over all possible late and sequences and that's really expensive to do.

[00:44:03]
So what we did was we took a. Simple approach of decoupling the dependence between the sequence 11 variables and instead said we have the latent variable that's going to correspond to the switch back sequence and conditioning on it is going to generate an output but the next step the latent variables value only going to condition on what the of generated so far but not on the previous Laden variables and this new completely means that now we can actually do this kind of inference efficiently if we can actually exhaustively do this so what this model basically gives us is a way to explore the latent space more exhaustively and we can decode out from the slate and spaces into the target.

[00:44:42]
And into the target language so basically we get better translation we get a better expression of late in space and given somebody clear set of latent values may be a part of speech back sequence this model produces better. Translations that sort of more I'd hit with the input that we get.

[00:44:59]
Skippy can solve this and stop there and take questions. Thank. You. Right so. That's why this is one of initially gave up on trade so it's not clear to me how to define what a mile the how question this in some sense or it's actually by you know implication it's hard to define what a single question is he there's a few minutes a single question is one where you have a single sentence which is all you need to answer that question.

[00:46:06]
And the date if it's like squad for example they might contain a mix of these multi harp and sing it up questions so we don't really know the performance but if you go by. Let's say the the performance on squad is kind of you know I would say isn't lowered bond for the performance in the single hard questions are fuming that single up questions that he's you write so the smartest get close to 85 even 90 sometimes in this dataset but then again it's not creative the martyrs are actually doing that I'd reasoning to whatever the answer or just exploiting some kind of 1st quarter Scottish not chart but still to come up with the answer.

[00:46:46]
That well. It. All really bottlenecks. Are all there. All. There. I. Think what. The big that. Like. On more or. Less likely. Google everything. Bigger. Like where. Larger works which have. Yeah I think so I don't necessarily think that or to various To think about this right one is I have some target environments where computers about like and I need to figure out how to get these models to work well there but even on the agenda setting that if I could do influence faster.

[00:48:06]
Google scale if I want to serve Google search queries using mortars I better be faster. And especially if the same orders that we have now will sort of tell us that I don't need the president I can fix it's usually wasteful but the other side effect of this is new you might actually get better training effectiveness because you're now able to train more samples than you spend more time so I think there's a whole lot of benefit there apart from you know the other things about carbon footprint in Exeter there's also implications there.

[00:48:39]
Loose another side to this is to sort of be thinking about barging next helps you just trying to be efficient helps I think this idea of throw more computing more resources to spawn that action I think that should be explored but I think there's value in thinking about constrained settings where order to things where you say Ok what is the chief bit of information I need to do something I think that tells you a lot about what you need for the task for general science I think that's that's useful still.

[00:49:09]
I think I'm. Also interested from this from this side of these I think I'm becoming a little bit of a control freak sort of want to know what the smartest do I want to control what they do I want to understand them better exit and I think with the good models and it's it's getting harder.

[00:49:46]
So. I think in this case we don't actually move it we only be we slice the input and sort of computer them and put them back in in the previous cares you know we don't change any of the words but you know you do it the 1st word is sort of like that we've now constructed something with parts from other stuff.

[00:50:07]
But. It's remarkable I don't know what to say of this like this is few appox of training in this model seems to settle down so I think if you state again I think it's crucial to stay as close to deal with architecture as possible then good things seem to happen.

[00:50:24]
If you've changed it to a completely different architecture then it's tricky so one thing that we struggled with was position and coding. If you if you would give person in coding for one than the other in the top layer some only to resolve the position and going there are some tricks that we had to put nothing particularly clever but that was difficult so we didn't so for example we found that.

[00:50:47]
So if the answer was located in a particular place in the passage. The model would always sort of do say this is the answer and it was sort of almost clear just looking at where it was placing it was always close to the exact answer but bigging random tokens of different type and that seemed like Ok there's something funky going on with the position and calling so we were able to debug that right away but there are really weird phenomenon because that can happen there was a question.

[00:51:58]
Wow. So so we have to be varied the places where we can cut the model perceive that it's a trade off I think we big 9 because that seem like that i Player to cut from an empirical standpoint but this explainable idiot is actually pretty cool I think.

[00:52:20]
It's so it's a way to try to understand model behavior in some ways so. I'm interested in expandability for its own sake not just in application to this particular thing. So we have some ongoing work rudiment to work on relation extraction that we're trying to produce. Listen to food and drugs and disease assess mind from by medical literature but alongside spit out why we think that's the case then the question is like now can you take the nation's back and feed it to the model again change explanation somehow and see if the model now gives you a different person because if it is actually using the explanation that's to be consistent with the decisions using the explanation.

[00:53:00]
But that's not really related to this there's a point. I read that I should have clarified Personally my preferred. The. You're right. Well. You know we hear. Yeah. Look. Here So I think so. Talk about the. It's an example but maybe it relates to a get used to read you to what you're saying so.

[00:54:28]
Let's say I have a Marty that tells me or I have a way of sort of figuring out from the margins and don't know somehow that it's sort of paying attention to this piece of information here and this piece of information that somehow because it's got here but I really don't know what it does with that after that I don't know it could say I looked at these 2.

[00:54:48]
And that's why I just check whether goes to a present and then I only use this bit of information to decide what my answer this but that's not really well to help reasoning also because there's sometimes I need to replace specific type of racing so my frustration is sort of at a level where we can't even do the 1st bit but actually to say that this model consulted these 2 bits of information in some way.

[00:55:11]
Just looking at the turn of I could write to this board such as explanation it's not the attention it's not explanation when there's some back and forth on the stuff so it's really we had that was another weird thing that happened with the mobile stuff also like you can't really even if you scale down the information to something the miniscule level the police can still excite that information out it's very hard to do the stuff so that's one of the station the other for me is more that we are somehow thinking that just by building a dataset we would have given the model the right training pressure so that it has to do that I think that's totally not true and we find so many weird ways in which the model can explain these things.

[00:55:57]
What I would like so I have I'm a little bit skeptical on in the particularly side for the same reason that I don't know I can look at this and figure out if it does that I think so I want to go back and say Ok let me do a behavioral study of this model so if we change the input this way if I feel it this way then it has to do the this thing if it doesn't do this then I know it's doing something bad.

[00:56:18]
So I don't know if that Nestle fits into the bottle neck kind of idea. But I'm I'm stepping more away from actually having to. Look at the internals or sort of beginning the devices as a way to say Ok it's going to do the right thing I want to sort of come up with tasks and data sets where this kind of pressure right pressure is being executed or use the models in ways that it actually does that I think.

[00:56:53]
I would be Yeah I would be happy so I think if I can predict its behavior so my thing is if the model gives me any explanation then if using that explanation to change something about the input then it has to be consistent with its explanation and to that I think that's the only way I think I can guarantee I understand I trust its explanation.

[00:57:17]
Thank you.