[00:00:05]
>> Hello, everyone. Well come to is that I am now at the g.t. semi nerdy and very honored to introduce anchor anchor. He's a senior research scientist at the New York office and he is also the Johnson assistant professor at away. You, his research interests, but a natural process, the more from a machine learning perspective.

[00:00:29]
Recently his work has being a high precision Text generation, which he's going to talk about today. And to receive his Ph d. from Carnegie Mallows. Very Department in 2015 and receives up past the Pera Award at the m L P 24 team and also best Peepers at a presentational value from it except those Let's welcome incur and then feel free to get US started.

[00:00:59]
Ok, Thanks everyone for having me. So this is joint work with Ti, Vo, just about an hour and I don't have to go research. So I talked today about your all Tech Cheriton. And I think a lot of papers recently, especially when you build it language bottles option that I think with low and very coherent.

[00:01:23]
But it's fun to hallucination. Which means that the models often generate resonance that are either unsupported or contradictory to the source. And this makes them unsuitable for a lot of real World Use Cases. And we're going to talk about is that we're going to study this problem. Going to see the challenged system from different dimensions.

[00:01:44]
Not just from the modeling side, but also from data and mission. And so this is one of the papers that in this effort, This is the road wire dataset that was really into Doesn't 17. If you can see that on the left, the source table statistics from a basketball game.

[00:02:04]
And the right is a summary of the game. And so there goes into all that work is you take and you ingest the source data and you put the target summary. And look, the other showed that if you look on the right, this is a neural afoot and the model has very coherent text.

[00:02:22]
It talks about who won the game, what happened in the 1st quarter with the top scores in the top rebounders where But a lot of these statements are wrong or they contradict the source. And this is highlighted in Blue and rest of the Blue is the things that are faithful to the source data.

[00:02:37]
And the brand is the things that are unfavorable. And you can see that the model gets wrong. Scores who won or lost the claim, the rockets were able to rebound the rockets 49 to 49. And I'll kind of statements like This is not just restricted to this, is that this is with the Bio Ware here.

[00:02:55]
You are given a source Infoblox From which you see a lot of press and frankly no, you need to generate a one sentence description of a person. And you can see that the model hallucinates, that frankly no is a criminal defense Attorney, which is not supported by any of the 1st We're going to be focusing a lot of data, the text of the SEC, but the problem is not limited to that.

[00:03:25]
So one cause of the lose a nation is not Is not necessarily the model. It's often the data oftentimes because of the way we collect large scale data sets. We've basically juristic Liebherr, different sources and targets. So for instance, when the office district, it was a buyout, they took the For the source, it would be is a lock, frankly no.

[00:03:48]
And then for the targets for the gold reference, they took the front pages of that we can. And you can see that here, and you can see that there is an entire phrase about being in the boat on a crime family and later becoming an informant, but not off the board by that.

[00:04:03]
And so when you train a model on the data, it's naturally encourage producing that That are not supported by the source to another issue or their causes elucidation, Isabel. So we typically tend to try to maximize some automatic and go metric like Lou and Lou. Unfortunately, did not really favor precision in the risk aversion and see because about constructive And so here You can see that this table is about Michael Douglas and the references that Michael Douglas was a drummer in the field of.

[00:04:43]
And so for him and Candidate one, get the high list for the candidate. One is wrong, it says that he was in the California Jam Grateful Dead. But this happens to be kind of a similar line to the reference. And it matters a lot of engrams and so it gets the highest look and model selection and model development favor.

[00:05:04]
Whereas candidate 2 is a more Conservative David. But because it's shorter a little bit with a look or even the right Candidate, 3 Is not really shorter, but it's giving a different fact on the 1st data. Then The reference does that also gets annoying even though it also says something.

[00:05:28]
And lastly, Because of you don't know if you did and other times the phenomenon that we'll see, a lot of functions like maximum likelihood, are often not suitable for stopping who was initially basically doing the previous goal Positions. Why once you might Miss one, you want to do your best at predicting Lighty.

[00:05:46]
And This makes the models kind of forensic testing. You think it's like you're a stick language model knowledge that it's learned from free training. It was so we've been working on this for a couple of years. And of on 3 different dimensions, data evaluation models. And then we have an attack going to Africa, A supervisor, she says, action, which I won't talk about the talk.

[00:06:14]
And I'm largely going to focus on kind of one paper from each from each Other questions before we move on to The actual content and not. And this point Ok. So the 1st thing I thought about is the data set that we've built That's in Yemen a few this year.

[00:06:40]
And you can find the data at this link. So to present overview data sets of events in novel control generation task where given a source that's a table at a table, many data like the table title in the section title and a set of highlighted cells to output of one sentence services.

[00:07:06]
So in this case, we have a table about Robert with an n.f.l. football player. And the table describes a certain statistics that he's acquired over the years. And we see that the year at the plate are highlighted as well as Little number of rushing and receiving. And we can see the target is using information from the it will measure data.

[00:07:31]
We can see that some of the statements in the target are supported by the table, but not directly in it. There are some inference over here to conclude their prey finished 11th events. You have to account 49329093 And some of the phrases Like like the rushing yard in the reception.

[00:07:51]
And we copy from the table along with the column at the pub. So everyone have a kind of what motivated US to build this data set and why we think it useful to the community for helping to analyze and stuff. A listen Ition and build high precision in Mission Systems.

[00:08:12]
And so we talked about how a lot of loose, a nation is held by data where The target contains noisy information that cannot be for And this means that I'm clear often whether pollution issues is caused by modeling weakness or data night. And so what we do to separate this phenomenon in our work is that we're going to have a novel annotation procedure where entertainers or start with existing will give US that this is that or natural and maybe not easy.

[00:08:46]
And they will, they will edit the sentences so that they're fully faithful to this. And this resulted in something different than what would happen if you asked the editor to write a sentence of scratch. If you ask an entertainer nice on the Sky still to be rightly very boring and which doesn't have much interesting variety or different linguistic phenomena that you will see in August.

[00:09:11]
The other aspect of our day to sort of mention it is a tough definition, which is the difficulty that we have a table. And we have a subset of highlighted cells that are highlighted And the model is kind of guided to talk about them. And the reason we did this is because we think this makes the task a little bit more control and easier to evaluate.

[00:09:29]
A lot of generations as these days or to find some radiation. And with some radiation, there's a lot of subjectivity in what the correct output is. For instance, In the dataset here, if you have the output of a paragraph summary, there are many different facts or phrases from the table or pieces of information that might be relevant depending on what types of things your interest.

[00:09:52]
On the other hand, there are other datasets like web analogy that focus on time just verbalizing a fully but I mean representation like a sort of a magic parser, something like that. But it's often very easy for neural networks to write fluent text. So this may not be sufficiently Count So our data set is novel in both of these places.

[00:10:16]
In terms of test design, we have a set of highlighted cells which give guidance to what they generate. But as we saw from the example, the model to lead to Think about the surrounding cells or The meditative effort produce the practice. And in terms of an addition process, the entertainer's iteratively revive a natural tendency on the p.f. so that their faithful, instead of writing something from scratch or using I think hard.

[00:10:42]
And that is one in 20000, handing them Pl, 2500 down examples and 7500 so I'll briefly describe how we, how we collected the data. So we basically got a bunch of people. So if you did, you can see there, but different topics in the tables have different formats and different And then we use joysticks like word overlap or hyperlinks and sentences that may be related to the table.

[00:11:13]
So France is disabled by Gabriel Becker and we found this year 6 and then after winning the German under 300 meter titles, you liked it rather than 99 watching And now we can see that the sentence is related to the table. But there are many phrases that are just not supported by right, so the table does not talk about That she won the German under $2300.00.

[00:11:38]
Your title is also not a standalone sentence. As you can see in Blue, it has a pronoun which is not desirable. Ideally, you would replace it with a problem to make the sentence on. So what the energy to do then, is there before the fall and have they 1st highlight cells of the part of the sentence.

[00:11:56]
So here they highlighted $195.00 World Championships. They also highlighted $100.00 metre because that's what support individually and read and or 100 meter relay, because that's what's really In the sense they didn't delete all the phrases in the sense that are not supported by the table. This results in things that can often be kind of under medical or not a complete sentence.

[00:12:22]
So then in the next stage in from the next legend, so they make, that doesn't stand alone. They replace the pronouns with proper nouns. And sometimes I think they've got to make the sentence make sense. And then finally, they perform a Grammar correction. Step to This is there are a lot of edits made.

[00:12:39]
We find that some of the Senator, I'm grammatical, it's a minor, minor change. This gives US a Penny said about 120000 examples. We can see that those have sizes really large is 136000 because the data set it up into Maine over And one thing that's interesting is that you can see that the tables are reasonably well site right?

[00:13:03]
There is about 87 cells or table as a medium, Which if you had to write a one sentence description would make this has a very ill defined. But because we have the highlight itself, there is only in terms of immediately 3 cells that are highlighted. So that makes the testing is incomplete more well.

[00:13:24]
Well, Ok, we can also measure energy to agreement at each stage. So basically After deletion we can see that the Blue agreement between the entertainer is $85.00. And then it gracefully degrades. If you are more so, which is something that In this is significantly higher than the Blue agreement you would get if you did the Blue bits in the original and the like in the Forty's We also, and I believe with the phenomena in the stands up on the left, you can see a healthy distribution.

[00:14:08]
You can see half of the topics are about sports and country. And the other half are kind of town topics, which we can see can be very challenging to the motto. Even sports, even though it's a majority topic, it can be challenging because a lot of the interesting reasoning example are We then just $100.00, randomly chosen examples and it's a manual analysis.

[00:14:32]
And so the 1st of the 21 percent require reasoning, 30 percent require comparison. The cost of a column Are there questions? I can look at the thing right now. Ok. So we ran a few bass lines to kind of see how they did on our data set for the 1st as a bird to bird about which is the bird initializing decoder.

[00:15:05]
Then we read up on a generator which is difficult to sequence, Let's say a model with a copy mechanism. And then there's This planning model for data detect which is and it's the typically design We have 3 metrics, you know, there's only 2 Blue and parents in the paper and we also report alerts, which is a learned metric on our leader board.

[00:15:31]
We can see they all agree In terms of the relative ranking of the systems, The bird to bird model. That's all about the funder generator. And Then we did, we did a human evaluation of the top performing motto. And we're going to compare sort of the headroom with the human Oracle.

[00:15:56]
So the way we can be the human Oracle, is that the dead in the test? I'm multiple references, there are 3 references Pulled out one reference and pretend it's like a model for diction. And then the human editor asked to compare that with the other 2 references above the source table.

[00:16:15]
And that's where we get the Oracle scores. We see that the Oracle is 99.3 percent fluent and 93.6 percent. They don't wish to make sense with energy to recruitment, things like that. And then the top performing model, bird to bird, is considerably less fluid. It's about 10 points of lows and it's about 50 more than 50 points less.

[00:16:44]
And this is evidence of that for one of the key questions that we, that we were interested in asking in this work, which is that even when you clean the data, I do know a lot of still isn't it. If we find that is the case, so even though our, our targets are cleaned, The models are still Not completely faithful to this these are some examples of challenging, challenging example in our model.

[00:17:10]
The 1st kind of category is rare topics. So these are things that the model hasn't really seen that much of in the chain didn't really doesn't know how to interpret some of the things like the table when it sees it. And so this table is about Micro guys.

[00:17:26]
And the reference is very informative. It's a 2nd generation of Microdrive with the now 5 in 2000, with the increased capacity to fight help make a life of money by which we can see from the table. Even some nice difference there, about the 2nd generation to the 2nd row and increase capacity, which you can compare to all of one Of a lot of doesn't really understand the topic.

[00:17:50]
I just said something like there are 5 felt like a Dr models. And Another interesting example of model for coming is that even when the model is right, it's not as it's formative at the reference. So for here, this is about how to help this village a season.

[00:18:11]
And we can see that the model is think something that's correct. If it is 2000 healthy, then the 45 perception, $722.00 yards and a touchdown, which is basically just coming from the table. On the other hand, the reference Interesting in front this is it just lastly just isn't because he doesn't post the last year of the years in the table that he played and how it's a personal career highs In receptions and receiving yards.

[00:18:41]
And we think this is one of the causes of elucidation that even when the reference is completely supported by the table, it contains language. Often that is not easily implied. That requires some sort of reasoning or background knowledge. And this kind of encourages model to listening to the model, they don't understand Why 2nd was there or why increase was there so it might just be encouraged to make So in summary, we really like to try on a data set.

[00:19:14]
We're also welcome to feedback or suggestions. You can email address. Are there questions before I move on to the next? I have one question here. So those So what do you have that they have said and then when you read the people to do that generation. How did you deal with that people?

[00:19:39]
Sometimes I Will just, I mean your representation of the cheapo news, all of the structures. So we did something very basic, which is what people do unless we just linearize the table with kind of roll call of decorator marks and highlighted marks for the highlighted stuff. If theoretically, as information preserving the model does not earn it, We also had a simpler version where I just take the highlighted cells and I just take the call of the run column headers and concatenate them to the South.

[00:20:14]
This is the information link, but it seemed to work better so that the results of Ok. In order for the model to get 100 percent really match the reference, it needs a better hit or Ok. We have 2 questions from the audience. The 1st one from Keith to apologize, but, and that to make the 1st a few minutes of the presentation is a model inferring data from People's directly.

[00:20:41]
Or if they'd have to also add those clues. Given to the Moto in the form of the yellow cells, is that one of the teachers, the term, eat, or model use all the time and the term irrelevant the self. So basically this is a part of the problem formulation.

[00:20:56]
The sense that, Yes, The Editors are performing a different task than the models. So the annotator is basically get a table, right, and they get this here as expected. And then they do this highlighting and they revise a sentence so that a face full to the table. But the model task is different.

[00:21:24]
So the model is given, Given only to be given only the table in the set up highlighted itself to generate effect. And the reason we do this is because it kind of makes the task harder, I think because the entertainer is often easy, the entertainer at work Because obviously the for the like Having the test is much harder if the model has to generate the than if it's just a generator revised sentence, given the initial and I'm not sure if that answers the question.

[00:22:05]
So the next question, the sanction of the broadcast is that there is no interaction. So he really cannot check out what they're I hope that answers questions. So not is you mention that it has started to the South to relevance or to a text. Can you talk about the whole data information enclosed data, that yellow part?

[00:22:30]
Well, as input to the part that you Berta, model, and also how to do, You could ease up our data to help with the House of issues. Yes. So The 1st question, basically we, we, we have special When we linearize a table, we have special taxes. I like That in the model of the regular model and learn that that was our 1st service addition.

[00:23:00]
The other representation that work better is just in practice because the model has trouble learning right now with the table structure. If you just take the highlighted cell with the headers look alike. But Yes, we use the tags there to indicate which other having Ok, Next question. Yes, we clean the target using human energy.

[00:23:25]
So the Human that are terrorists basically will do that multi-step annotation because they will basically delete phrases that are not supported by the 2 And things like that to produce and have a clean data. And we released the outputs of the intermediate stages off of it. You can, if there are other kind of tasks that people want to create, They're all, they're all and we're this Ok, we have 2 more questions you want to take it right now or save it to later since you are just.

[00:24:05]
So I see one question about it, a lot of the same from scratch. Are you saying that I have a different language? Well, it depends on the model. So I think the bird to bird model is initialized from birth. So that one is strange. That one is built on top of the train model And I think the leader or someone in office a minute if I think that's also The point generator and the other thing but are not.

[00:24:30]
And I think the use of I'm betting they're just going on the data and the other question and it's a, I'm not sure which on the other Is a last question is I'm New to this topic up, but I want to add the real World applications update how to text generation I think there are many, there are many of like I think we see a lot of structured data in the World, right.

[00:25:02]
Whether it's cable Or it is. If you look at Wikipedia, like we see data, you look at kind of there's financial data, there's or stated, I think there's a lot of structured data and being able to reason about it and generate important. I think we focus on it in this data Center because I think it's a good test bed for generating precise data.

[00:25:27]
And I could also be used for other applications like Texas Tech, amazing, Because it's easier for a human editor to A lot of a lot of, you know, detecting hallucination involves, you know, you've got a model of sorts evaluated by humans. And when you Kind of sex problem kind of makes this table Kind of the World right, the World is much smaller, right?

[00:25:58]
It's not like whether this phrase is true, you know, all of the World. It's only whether it's true According to the table. And I think that makes things easier to evaluate and study the I think that's why we focus on the state And I think It should not.

[00:26:22]
Yes, thank you. Ok. So I will next talk about about all right, so I think he too, you know, we can have datasets looking at models. But if we cannot actually evaluate them and the fact that they're precise and very helpful. So I'm going to talk about a blurt, which is a learned metric for protection ration and t. both allow, you know, our team has let, let the sector and so metrics are very large bottleneck to generation progress.

[00:26:53]
And I think that what you want from a metric is often different depending on the top. So here for instance, this is predictions and references in the sort of data set. We see that different model shortcomings have different types of characteristics depending on what we're looking for. So It's obvious that the prediction is wrong, like in the 1st one, there's a lot of incorrect word overlap.

[00:27:19]
And so like an entire, you know, half of the sentence doesn't match the reference. So it's very easy for any metric to conclude from. However many cases, the differences are very subtle. In the 2nd example, the model only wrong because the 2 numbers 8 and 6 are right, where the correct numbers are 6 and 5 and Blue would not, for instance, Be able And another interesting case is that even if bottle is correct, sometimes it's not as good as the right.

[00:27:50]
So what we're arguing is that because we have specific tax oriented relationships that we want to measure. Learning metrics are very good for this, right? Because you can basically take a pacifier, you can take care, you can fine tune it on whatever rating data you want, it has, whatever the scores can mean, kind of whatever you decide, you want them to measure.

[00:28:12]
And then you're going to learn a metric that behaves like US by The problem is that this is brittle and it requires a lot of fine tuning data for every New data set or time. So just like when you're building models or cousin fires, every time you want a New one, you have to go and get a bunch of little data.

[00:28:34]
And what we want to address is we want to try to make it robust so that we can have a really fast out of position. And so what we've proposed is an additional mentioning step using synthetic data To allow US to adapt very quickly to other domains and be robust to different other phenomena you would see.

[00:28:54]
Like trains have things like that. That what we propose is that you take part, you 1st need training on a lot of synthetic data. And then you have a small amount of human rating data that you find to not get looked normal. Show that at least when it was published, it was dated.

[00:29:09]
The are empty and women are so because this is that, if you know, our goal is to generate pairs, x. x. Prime that resemble reference prediction. Right? So we want to kind of simulate model errors that we would be likely to see. For instance, deletion or word substitutions that you can see on the right with related words.

[00:29:35]
Like replacing Tuesday with Wednesday or adding I just to like groceries or having repetitions like onto the onto Any one of those are not kind of things. You typically find paraphrasing that it said we're kind of using synthetic data to get large amounts of such data. And we have 2 techniques for doing this.

[00:29:56]
We, We use random, We randomly back test, and then we employ Bert to sell the math. So we take the census, we randomly mask out some tokens and we fill and we fill in the math with different phrases. And then we also use vaccination, which is common for generating, going to paraphrase it.

[00:30:18]
So we can take a sentence and set it to a different language, like French and then hands on it back to English and get a paraphrase which can sometimes be done. And then now we have all this data and we once a week supervision signal. So we basically use a variety of already computed metrics.

[00:30:37]
So this multitask objective that we use Blue routers, sportscar, Intel, back cancellation, probability and lots of other stars. And so you can see our results here and of the n.c. 17. The other results are basically Loose and Blue is a sensible Blue metric Year move or shorter and 1st or all other little metrics.

[00:31:04]
We can see that just fine tuning The ratings they don't without Without doing any make trading just physically nightly, taking burden fine tuning it already is a state of the art. And then we get an additional gain through Pearson correlation by using our full approach with which involves the change.

[00:31:29]
And so Well, we really just said in addition to this, going to be the result is how robust this measure gets. So Let's say we use a training set to lower ratings and we do the training set for hiring. Right? So this, this is simulating kind of extreme version of what could happen if you have a metric and the model.

[00:31:52]
The metric assist the money, but the models over time are improving for them and the data the model is trained on is worse. It represents worst model than what the, what the metric with the year 2. And so this is shown on the left where we basically have a few factor.

[00:32:10]
Whereas if you factor increases, it's like we're skewing the trainings that left the lower readings and the test set, right? So there is more Shift in the rating distribution between 10 and and we can see that without the pre-training or without the synthetic data, retaining the model is not very robust.

[00:32:31]
So that different bars light is the lightest Grey is blurred with those few. Right? And if you get darker Blue, it becomes blurred with higher in the holes a few. And you can see that the performance degrades quite a bit in terms of at the top. Whereas Blue, which is a Green inverse, far more about well when you add art synthetic, that appreciating staff, you can see that blur is considerably more robust, right?

[00:32:59]
It's more performance better then both burst or for all the use except for the most heavily oranges. And I think this is just kind of the overall pick up that you get when you learn metrics, right? They're more focused on performing well on the data they were trying We also experiment with web analogy.

[00:33:25]
So basically We All out. So there's 2 types of things you can do. You can hold of some of the systems in training and have different systems and that's But on the same data point and the other solution. The other way to do a train, at least is to split the inputs.

[00:33:47]
You can have the same systems in train and test, but have different examples. And basically we're measuring how quickly the model, the data. So if we look at 1st of the bottom right, the all record 0 to 24 inputs is basically 0 shock. When you take the model, whichever model is it just running on the web and on the test data.

[00:34:11]
And here we can see that First of all, Blue t.r. media and 1st start don't use any reading data, right? So they're fixed. All, no matter how much training did you have. Learned minus pretty minus w.m.c. is basically when you just take part and you fine tune it and whether or not she, which in this case there is no web Elegy in the 0 shot.

[00:34:33]
So if this is the white bar, it does very poorly. You can also just take blurt with the training, but without the w.m.c. data. And you can see this is the light Blue. It also doesn't do that well, but it's much better obviously, that Not having the few joining and then blurred is basically the full system trained on the MC.

[00:34:54]
So it's with the synthetic, retraining, and with the w.b. reading data, but no web analogy, and you can see that on the are shot. It actually does pretty well. It's getting comparable to the State of the art. And then as you move right, you'll see that you're adding more inputs.

[00:35:08]
So The next part is like 800 examples, then a 1000 examples are doing 252600 examples. And you can see the model easily attempts to You know, of course the gap between, you know, the benefit of doing 2 training or w.m.c. reduces and you see more innovation. And there are questions about this.

[00:35:33]
Actually Yes, There is no question about that. Your strategy was if your mask, if you mask a randomly fall to the saddle 12 and over versus not exactly a piece of the Mark, the word I think the more the bird model determines that. Right? So We're using birds are seldom asked.

[00:36:02]
So if we mask out, we have different strategies, sometimes we mask out random tokens. Sometimes we messed up into US tokens. And then the bird model uses them search to pick the best films of the mass of the model. The model, the language bird is kind of like a language moderated, kind of knows that it should probably put a verb there we decide some of them.

[00:36:25]
Ok. Another question is when you see speech apply, stick to what it does this to me. I think that this is in the evaluation part. Yes. So basically There are different models, right? That you would like to evaluate to each, you know, this in this weather, no easy task.

[00:36:48]
Different participants submitted their own system to be evaluated, right? For instance, you know, there are 9 systems, so there are 9 different participants and subside. And so when we split on them, that means that some of the systems are only used for training and some of them are only used for Ok.

[00:37:13]
I think that's all for now. Ok. I will Love Ok. Thanks everyone for the questions and everything that's very helpful to me as well. Ok, So Leslie I'm going to talk about but also how much time or do I have this philosophy 5 thing, right. I think you have another 2025 minutes.

[00:37:42]
Ok. There's plenty of This is work that's led by Ron 10 in our group and this is a modeling based approach to reduce Alyson action. And We're going to go back to this with the Bio data set because I think there's a lot of hallucination ISH in this data that's a very good test.

[00:38:04]
But for reducing listening, We already pointed out that the references, I think, as we showed in the time, we've seen that the baseline, who is anything like the occupation of someone, even if there is no occupation of the sort we would really like to say which is what our model output is only things that are supported by the source, which might be more Conservative.

[00:38:28]
Here will just say that frankly, this is born in October 19th. Brooklyn is an American so our intuition is that what kinds of things should be saying? What kind of things should be The Template if you are like born or is don't convey any source information. And they can be generated by the unconditional language model that doesn't really know about the source data.

[00:38:59]
You know, for these types of phrases. We don't necessarily need to be copying things from the source in order to be confident that they're not say. On the other hand, we see that content words that are faithful to the source generally require attention to the source like the name of a person, the date on their nationality, etc.

[00:39:22]
And elucidated phrases are typically content words that are not closely associated to the source. So there are things that an unconditional language model would be unlikely to generate, But are also not in the source. So this gives US intuition as to a confidence score for token that we would like to define.

[00:39:44]
We want this confidence going to be high if the token is a template of corn, which is in Blue, or if it's supported by the source which is angry. And that the confidence course would be low if it's not in particular, and it's not supported by the go to model of the word attempt by ****.

[00:40:03]
We basically use the base language model that is not condition on the source, Which is probability of life he came in the previous or the turning of the word is supported by the source. We use A score 80, which is an attention score, which is basically a function of the attention distribution.

[00:40:27]
So if we, if we look on the bottom left, where we see that this is the puzzle of Whitey given the previous why the next. This is typically modeled with a saw fact where New if he is the decoding the decoder are going to time 50 and he of y.t. is the embedding for y. t..

[00:40:49]
And then there's a normalization the next to get the probability. And this New t. is a sum of 80, which is the attention vector at point t. and 8. He was at the l.s.t. m. state appointee. And we basically say that the attention score is high. If a t. has a higher norm relative to the other relative to h.d., which is what kind of happening here.

[00:41:14]
When a p. is high relative to a seen New t., we see that the tensions force high. If it's low, then the term, then we know that the h.t. dominates, that's kind of Not supported by the source. We have a low attention and then we basically put these together.

[00:41:41]
So we define the confidence score at position t. to be the sum, which is like an or function To be 80 or plus one minus 80 times the base language model problem. That's basically like either your tensions course high or the like. What about a common Now that you think that the confidence score depends on learned from others, right?

[00:42:06]
So he is a function of the l.c.m. parameters, this language model. We are basing this on that we also would like to jointly train ideally with a model. And the question is, you know, how do we Jointly train this with the mob? And then of course, how do we use that inference?

[00:42:28]
And so it's a bit of a chicken and egg problem, right? So if we knew the confidence course before training, what we could do is we could noisily clean references that we know will have a low confidence car. Right? So we would know that I could Demick author and radio host as a low confidence car, and we would just omit these when training them.

[00:42:49]
But since a Conference call depends on the model parameters, we don't know the set of time. So we employ some sort of variational based e.-m., like alternating strategy, where basically we're going to subsample confident sequences in training Based on the current confidence score at a particular iteration. We subsample a sequence with high confidence to say not.

[00:43:11]
Right, so in this particular iteration days, in film we're not sampled because maybe they have low confidence. And then we get a target to train on the toughest improve our car, going to work as we improve the model parameters. And then we go back and we keep resembling we also have a nice contest time based on the sort from Braverman and all where we can use the confidence or to calibrate out the probabilities And weigh things higher that are higher confidence.

[00:43:43]
So here, this s.g. means stop gradient. So this is a one parameter model. Basically everything has a stop grading on it. Nothing is being tweaked except for this couple parameter, which is kind of weighting the confidence score with the rest of the How much to adjust the probability by the Comfort score.

[00:44:02]
And we can see that when companies doesn't already have a typically the model will learn a non-zero value up. So our paper has an experiment result in which you buy on my Banaji. I'm just going to present that we can buy a once. And then what we can do is our strategy works with many different models.

[00:44:24]
So we take a point to generate a model with and without our confidence becoming strategy. And we also take a virtual s.t.m. of all within with our country. And I valuation metrics is before our Blue Parrot human evolve, and they are mostly focused on human about and Blue and the parent.

[00:44:44]
And so on the left here we see is a human is of the faithfulness which is like the position and on the right is the The, what we see is that 1st, the confidence decoding significantly reduces elucidation. So the pointer generator is 80.3 percent faithful. And when you add the confidence decoding to it, it gets 86.8 percent.

[00:45:12]
And similarly for the Bertie s.t.m., it's a little bit less faithful to begin with. But you also get a 4 point boost when you add the company. We see that this does track a reduction in Blue because Blue typically is not really Blue, like The prediction in the reference to be the same lead for instance.

[00:45:33]
And typically we this contest occurring with higher content will produce shorter and Blues. I'm happy We also look at fluency. We see that fluency is mostly similar to all the methods on the right, we measure coverage by the percentage of styles. Like pieces of table information that we're cover.

[00:45:54]
We see there's a minor drop in fluency for some of the confident models like 39.4 to 37.8 or 38.4 to 30.7. So I guess that one is not now in another direction that one above them. We can also further increase the coverage by basically using the lens penalty with in joint with the confidence decoding.

[00:46:21]
And this will increase the coverage to be higher, but it does often come at a cost. So there's kind of tradeoffs whether you want faithfulness, coverage or fluency. So this is the general work that I had We, we presented this multifaceted approach to tackling, whose mission on the 3 perspective state evaluation model.

[00:46:50]
And you can see some links here to our dataset into the blood metric, which we hope you'll find helpful. And thanks and feel free here if you have questions or the e-mail lines. Thank you Ankur for the great talkers. So we have several questions. So I'm going to read a bit From the most Recent away.

[00:47:16]
Ok. So the 1st 1802 Nerida flow in see automatically in your diet. This is done with humans. So basically just losing the coverage and the faithfulness. We're all human evaluated. We showed a human entertainer The info box reference and the can is a prediction. And we as a human entertainer to say the kind of prediction was flawed or not.

[00:47:43]
And same with coverage and with I mean, you could, you know, use fluency data somewhere and you could, you know, find, you know, blood metric to be no specific difference if you wanted something on a minute. Next question. I think you and I could this question a lot, have you don't try you, you next to example can eat basically.

[00:48:09]
It's a model House and it's you provided that we'd sit back and say this phrase is not supported yet. So I do really like this question, this is something that I'm very interested in. I agree that We would like to do this and there is some work on doing this.

[00:48:28]
If you look at like something like unlikelihood training, they try to do this. I think it's just not There is not like In classification, there's a clear kind of positive and negative paradigm. Right? But in generation that's not by tradition, the case, right? You have, You kind of have each example is like a positive right in your assuming everything else in this case to be negative.

[00:48:52]
And so I think there's some interesting research that we're pursuing it. Other people are pursuing to kind of determine what are the optimal ways of giving feedback. Because if you give, I think it's tricky, like if you just give feedback on certain phrases like in this case, you can just truncate and remove the phrase, right?

[00:49:10]
But sometimes the phrase is in the middle of the sentence and it might affect the flow into the bottle, for instance. And so I think it's a little bit trickier than it is for the it is for the consecration. But I think it's, Yeah, I like this research area very much.

[00:49:26]
Ok. Next question. How do you see the lack to the statue of prediction? The signals do different choices affect the performance of a larger organ if you use more or less source of signals. So this is regarding the Blue Bird. Yes, Yes. We have an aberration in the paper.

[00:49:44]
I think the best one, the strongest. One is bird score. And then I think Blue and Ridge are the weakest. And then maybe the Intel men in the back transition probably are in the middle. So we did it definitely obviously matters what you used to. Yeah, we are what you used to be training, You know, and stronger signals, stronger and more diverse Ingles obviously And weaker signals like Blue is a pretty weak.

[00:50:10]
It's trivial for the model to learn to imitate. So. Ok. Thank you. So another question regarding that they have started to work. What was your biggest The challenge was building the multiple appropriate data. I mean, definitely preparing the greater right because I think that was the main contribution of the paper from multiplies.

[00:50:39]
We just kind of took things off the shelf, as I think is, and of conventional furtiveness the paper. Now I think in the future we want to innovate more on the model type to develop better models. That do more reasoning and don't have to as nation and things like that.

[00:50:55]
But the paper was the main contribution is the New data sets of the Big fines are just things we took Ok, I think that's all the questions we have. So if anyone is interested, you can also Contact, encouraged by the e-mail, if that's Ok with US. Ok. And then last the question To ask, I'm curious about is the purchase encoder and I'll ask him to call there.

[00:51:29]
My understanding of the reason for Alice t.m. is for the process of testing a racial fiber if other to compare the difference between the input and output. If a company that's who they get the Ok to use pretty decoder, then compare that to a tissue factors. Yeah, so REN and me, we had like basically the Ruby code right now Because of the exact formulation of the Ascension score, right there is this, let me, let me go to the Sly.

[00:52:07]
Right, Right now it's formulated with this. It's kind of the ratio right, of this attention score with the other state. Right. And I think that kind of makes sense when you have an elephant decoder because there are these kind of 2 components. But Bert are like a transformer, decoders all attention everywhere, right?

[00:52:27]
So I think the, our notion of kind of attention or becomes a little bit more confusing, which is why we don't have experiments with with the transformer to go. Ok. I think that that's all the questions we have today. Thank you very much for your great talk. I think as you can see from the questions to read like the work that you are doing and we want to thank you again focusing this great talk.

[00:53:00]
I think that's all for today. Ok, Thanks everyone for having me and for listening. Ok, great. So further questions feel free to email anchor and That's also today. Thank you. Have a great day.