[00:00:05]
>> Thank you for coming to the machine learning seminar on is Bing not our normal time we had a very special guest this is an officially our 1st and they'll be your detect them are not official thing but we have a bunch of LP speakers so David Baron would be here today we have another one on the 25th December 2nd to the 3rd of your 2 naturally processing and I think we can hear very Make a note on the counters and you'll be seeing a bunch of advertisements about that we're going to go off and we're in a day in the decade on a strong.

[00:00:41]
But we're here. And I'm so pleased to bring you here he's one of the people where I just love reading his papers very near and dear to my heart he says this is a professor from Berkeley school in remission and he works really in the intersection of natural language processing and what I think of the digital humanities but he looks a lot of the literature and literature is an interesting actually processing problem he'll call you all about it because unlike they would give Pedia which is much more fact based there's a lot of complexity in how people write when they write stories and as well as social media sort of thing they don't stick to nice niceties of natural language and Dave is going to tell us all about how he's been trying to tame the beast of literature so great thanks.

[00:01:26]
Ok So thank you thank you Mark thank you all for being here my miked up loud enough for everybody to hear me fine Ok got how many of you here have either working or have an interest in an LP I'm glad to hear wow this is awesome Ok great so what I want to talk about today is another application area of an old p.c. and maybe a little bit different from the kinds of applications that you're either working in your research or really you know studying in class very very different from question answering from summarization from machine translation one of the talk about today is the use of n.l.p. essentially as an algorithmic measuring device as a way of guess giving us some quantified measurement about some phenomenon that can shed light on questions of literary history or literary theory in this particular case I'm using the example of literature here to see how we can use these methods to shed light on this fundamental problem of understanding literature but it's really also a kind of application that has much broader appeal to lots of other areas as well or including not just a digital Mandy's but competition social science competition journalism or whatever we want to see some signal in text and extract that using these algorithmic measuring devices this work is in collaboration with math instead Underwood Sabrina lead Jerry Parks it will pop out and showing Shawn and I point out they really couldn't have happened without having collaborators who are coming from literary departments so that includes my postdoc Matt Simms also Ted Underwood serenely who are both in English at the University of Illinois Ok so let's jump in now how many of you have read a book in the past month.

[00:03:04]
Wow this is surprising that is so so many of you when I was in grad school I mean there's no shame in this when I was in grad school I focused on and p. for literature and it was very rare for me to have any time to read an actual book.

[00:03:19]
For those of you who haven't read a book in a while this is what a book looks like. This is Moby Dick and there's a lot of ways in which the kind of language we see being expressed in books and novels in literature in general is very different from the kind of domains that we typically study you know very different from news very different from product reviews very different from tweets one thing that distinguishes literature is its length right so books on average are about 200000 words long right that complicates a lot of the kind of standard algorithms that we use an LP for things like co reference resolution Co reference is essentially quadratic in its complexity with the number of entities that you want to call result so at any given point you want to compare with which with all of those previous entities you've seen to see which one you want to link to as being the most similar books also and literature also tend to have very long sentences right so the average sentence length of the Wall Street Journal is something like 20 words the average sense length in literature is about 40 but it's not quite so evenly distributed with.

[00:04:21]
What we're talking about why this matters is matters for questions of parsing ricer parsing has a come competition flexi that's cubic in the length of the sentence for constituency parsing and for an average sense the length of 20 words that's fine but when we have sentences that have that stretch those boundaries and get to the maximum length in literature like 100 words or a 1000 words that complexity really blows up and effectively making these kind of methods really unsuitable for this task literature also has complicated by its use a figure of language right it uses metaphor and depictions of this imagined space that's very different from the kind of new David Soucie and semantics we have for expressing propositional structure and content in these sentences so literature is very different in many ways from these kind of domains that we typically study.

[00:05:08]
Literature is also been used in a couple different ways in LP right so one way in which literature is used as being a source of data and so if any of you here have used Bert and Opie Bert has been amazingly successful for a lot of tests right responsively raising the state of the art by 5 points on average for a number of different tasks and LP Burgess trained on Wikipedia is also trained on the books corpus to score was about 7 or $8000.00 or self published so it's a really Blart source of data that these kind of can textual language models are really.

[00:05:44]
Learning a lot of information from about the state of the art of the real world narrative 2 ways in their ideas that we see fiction being used for for the use of this task of question answering or trying to answer questions that have to screw up over a much larger context than just a single Wikipedia article or a single news article we see fiction being used in common sense stories by work by measuring most of Asa and others this is a creation of a dataset of 5 sentence stories where one task is trying to predict how the story ends it's also being used as a fundamental data that drives a lot of work in semantic and syntactic change using the Google Books to set in particular to see how words have been used historically over 200 years to see how those words have been changing in their sense.

[00:06:31]
At the same time there's a lot of work in modeling specifically literary phenomena that includes my own work and trying to model character types in literature and in movies includes work by Mohit year end of 18 others and looking at the relationship between characters and novels trying to have models that capture a sentiment or a plot some representation of block that can be either as simple as sentiment or something more complex and also even getting into questions of character psychology right how it is that characters have specific motivations for the kinds of actions that they're performing that we have observations for in these kinds of data so.

[00:07:08]
These are 2 different sort of dimensions in which we see fiction being used in an LP right now right where these kind of papers are being published at venues like a.c.l. email and Nachle There is however an entirely different sphere of research that we see this kind of work being driven in in terms of competition in methods for recent about literature and that's work in the computational humanities this here is a snapshot of some of my favorite work in this space just a call or 2 examples of what this works looks like from a literary perspective 10 under what has great work in measuring how much time is he lapsing in the span of a given piece of text so within a given paragraph is there one minute that elapses is there 10 minutes had elapsed is is there an hour is there a year is there a century right i can we go about predicting from one piece of text how much time is being a lapsed in that piece of text holds Cosmo also has a great work on measuring how loud a novel is or loudness in this case is very much tied to the specific verbs of speaking that introduce quotations so if you go about establishing a measure for every individual word that can introduce a quotation saying has a score of 5 of not being particularly loud or perfectly quiet while shouting has a very loud score so if you measure out what kinds of verbs are being used to introduce quotations you can get a sense about how loud that sort of audio environment that's being captured in the context of a given novel so there's a lot of work that's being driven by fundamentally literary questions here that's using these methodology you said you have been developing in this space of n.l.p. to really drive some insights into the fundamental problem of literature you know what I talk about here in this very beginning is give you one case study for how we can use an LP to drive literary insight about a specifically literary phenomenon and then talk about ways that we can develop more complex and LP models that can capture more complex questions that we might want to ask that can drive different forms of insight and the question I want to get to here is this fundamental difference between the depiction of men and women in fiction between the depiction of what male characters do and female characters do over 150 years of literary history.

[00:09:23]
Now there are this fits into a lot of other work in gender bias right now where we know that there is a lot gender bias a really creeps up every time we have a system that's been trained on data and that it really impacts other kind of questions like the representation of different genders in different data sources right so we know in particular that only 15 percent are we could Pedia biographies are of women so even just the quantify nature of how many biographies are of women or is very imbalance in this particular data set we know that even looking within those biographies that women tend to have much more emphasis being given to these acts and events of marriage and divorce than men we know that bias creeps up.

[00:10:06]
Implicitly in any kind of Mall that's trained on natural language rights a word embeddings to incorporate this kind of bias very famously. And includes kind of cultural bias that's implicit in what in the kind of natural language we use where we're cooperating the kind of bias that exists in reality in our depiction of the words that we use to describe it and this kind of bias that creeps in for one of Manning's also propagates to a lot of other downstream help you task right including Co reference resolution sentiment analysis even speech recognition by so biases prevail in these kind of datasets.

[00:10:40]
The question of whether or not there is this same sort of an even an even representation of gender in literature is really fundamentally a problem of measurement and what I want to do here is truly try to explore if we can design an algorithm an instrument that can transform a single novel into a quantity that really expresses the depiction of gender right in this is that that a big picture what I think a lot of this kind of classic methodology really captures of trying to take an entire novel or an entire long document and reduce it to a single number right to take all of Tom Sawyer and reduce it to the single number point 53 Now this may seem very reductionist right but in many cases this kind of redux.

[00:11:23]
As what we need to do to test specific hypotheses to be able to articulate this specific research question a way that can be answered in this context and part of the arts of designing these systems is how we go about incorporating the kind of knowledge that we have about these phenomena and about these specific data sets of domains that can let us let this number be informed and intelligent way so the way that we're going to get at this number for text is to again reason about these core objects that we want to make claims about and that's characters or we want to see whether or not there is an uneven distribution in attention being given to male and female characters over the course of these novels this here is dependent on character so what we need to do in this case is use these methods that are being developed in n.l.p. to reason about what it characterised right to identify and define a character and then have some measure of what we count as being attention so attention we can measure in lots of different ways we're going to measure attention here as as the amount of screen time those characters get in a novel and we're going to find screen time to be the amount of things that those characters do and have done to that.

[00:12:32]
There are a lot of other ways that we could go about defining what a tension is we could define it by the amount of dialogue that a speaker has we could meant define it simply as the amount of characters who are of a specific gender across all these different dimensions we see these results end up being the same as I want to describe well and I'm describing here is this specific operationalization where attention is the set of things that characters do would have done to them.

[00:12:57]
So if this were to rely on an LP pipeline that I developed a few years ago for really applying all these individual steps in LP to book length documents that can get past some of these hard problems of computational complexity as a function the length of these these that these objects so includes parse speech taking named into recognition dependency parsing to get at the specific attributes of what characters are doing and having done to them character name clustering to resolve different mentions of characters to the same unique thing prominent after a resolution which can link specific mentions of he and sheet to the characters that they're associated with and also a quotation attribution to link the specific mentions of a character to the dialog that they're speaking about and all this is up on get help right now and has been for a while here.

[00:13:43]
And there they will go about representing a character again with this operationalization that we define a character as being the set of things that they do have done to them also the things that they possess and the things that are predicated of them looks like this right so we have an entire representation where we know who the unique characters are through a character named clustering improve corrections resolution we have a way of representing what though they do as a function of their syntactic dependencies so Tom Sawyer here is the agent of patient paints he's the patient of being kissed he has pain from brush belongs to him and he's predicated as being a rascal right so this here is the set of unique things that is does used to describe this individual character if we take all these individual counts of these specific relations add them all up that gives us a single number for Tom Sawyer Tom Sawyer here is being reduced to the single number of 260.

[00:14:37]
Ok so that gets us to what we can use to see that whether or not we have a difference in performance difference in attention being given to male and female characters here once we have this operationalization of a single number along with the gender of those characters we'll see if we can see any significant differences so this will apply this to a data set of 100000 narratives from the honey trust digital library all these are published between 17032009 all of this data comes from the university library scans right where the books have been scanned in a diversity library and then o.c.r. to extract the text each one of these books is then subjected to that same pipeline that I just described and also have an additional set of $10000.00 narratives from a separate data source from a Chicago corpus that's publish from a different time period I describe how that can corroborate some of these results Ok so what we see here when we take this is very simple measurement right once we have a way of operationalizing when attention is for a character where we can just do is count up the amount of attention we have for a character given their gender being female and normalized that by the total count of attention for all characters right and this is the plot that we have over 200 years of literary history where we end up seeing is that overall you know.

[00:15:52]
Men tend to be focused on much more in literary novels than women do and that also tends to be going down over time I don't only possibly being corrected with the rise of 2nd wave feminism in the 1970 s. by this one theory you might have your. Now if we had evil even representation this would look we would see a number that's closer to 50 percent by 50 percent attention to male and female characters the fact this is below this line means we have more attention being given to male characters in female characters.

[00:16:19]
We might postulate that this is something to do with just a specific datasource with just the hottie truss and if you plot this same measure of attention for the separate corpus with a separate time frame and separate selection criteria we see it overlapping the same pattern so core operating to a more or less.

[00:16:38]
Even good degree in this kind of context. We might ask though if there's reasonable differences that come about as a function of the authorship of these books and we have good reason for thinking that because if we look at how many books were published by men and women over this space we see a very similar pattern that over this course of 200 years we see from the very beginning of the 19th century that there were as many men and women authors of books and that tendency tended to go down over time only to somewhat be corrected in the 1970 s. I was a very similar pattern here there tend to be there used to be many more women authors and that tend tend to be overrun by many male authors as time went on.

[00:17:21]
We might have good reason for thinking why this is the case right so there's been a lot of work investigating exactly why this phenomenon happened of women being edged out of the publishing in terms of over the course of this time period before 840 s. to 14 points out at least half of all novelists were women but by 917 most high culture and all novelists were men and they attribute some of these some apologies for why this happened include the fact that in the early 19th century fiction was not yet a very high status career right but after 848 increase the brought status which brought more men into competition right into it into taking over these roles from women.

[00:18:03]
Also one possible effect here is that terms of publishers contracts ended up becoming more disadvantageous toward women in this time period which edge women women out of this publishing space and also the more positive side there were other careers beside novelists there were opening up to it so women didn't just have to be restricted to this one kind of profession there are other opportunities there are opening up that let them go out of this market and sort of just having pressure from them going in so we see that there is a fact of potential changing distributions of what the authorship looks like here that might have some effect on the the measure of attention that we have for male and female characters that we see captured in the.

[00:18:41]
Dolls and in fact if we look at this same distribution but separate out this finding of the attention being given to male and female characters is a function of the gender of the author we find is something very striking they're looking over the same time period women as authors end up allocating equal attention to male and female characters right much closer to 50 percent while male men as authors end up allocating 3 times more attention to male characters in the female characters writes a finding that may be very obvious in retrospect but it's something that really is only driven by the fact that we have empirical data to give us this kind of signal and tease apart this distinction I once we have some measurement of what a character is have some way of measuring what attention to a character looks like and then factorize this space as a function the gender of the author we see this really radical disparity which we wouldn't have seen before so this process has told us something fundamentally new about literary history we know something new now that we didn't know before so the takeaway from this case study 1st of all is that we can use these methods of Annele to generate new insight for literature it's possible to do this using these kind of compositional methods that you're developing here but at the same time we know that there are a lot of research challenges in an LP that we need to overcome in order to drive more complex analysis.

[00:20:04]
And the kind of more complex analysis that I'm thinking about here can really be articulated with this graph from x. k.c. So you guys are all mostly grad students and professors I'm sure you know what x k c d is but this here is an example of the plot structure that's being visualized for lower the ratings in the in the context of this comic if you ask what this plot structure is we can see that whatever plot is right plot is very complex nebulous abstraction we don't know really what plot is but what we do know is that it has to involve a certain set of individual atomic components or we know the plot has to involve people right in through this case here we see plot lines being defined by people who are being doing the same thing together at the same point in time so in this case here this is the plot line of merry Pippin Sam Frodo and Bilbo who are all together at Bilbo's party with Gandalf and up together with Aragorn at Weathertop end up at the Council of Enron with all these other people to write the plot lines here are defined by people who are doing things together in the same place at the same time this is a level of complexity that we would like to get to to help drive our understanding about literature that we really quite can't quite get at with the state of the are going to help you right now now.

[00:21:24]
What my group is doing is trying to take this really complex abstraction and decompose it into these individual structural elements that we know need to be a necessary part of it each of these individual parts also has an existing n.l.p. application that could help drive this kind of work in this area so this includes We know that plot has to evolve characters right and characters involves this fundamental task of entity recognition we know that plot has to involve events of characters doing things so when this is corresponds with the task of event detection.

[00:21:57]
We know plot has to involve setting the places where these events are taking place that involves anti recognition again and possibly also setting Co reference so even if a place isn't named we know that it's the same place as another place earlier where other events happened earlier on we know in many cases plot also needs to involve objects right and Russian folk tales magical swords and potions are really important for advancing plot that's important for this it will be task of object attention detection in Co reference resolution and we know that plot also needs to involve time which also corresponds to the fundamental n.l.p. problems of temporal processing and event ordering now the issue with all these different existing steps in an LP pipeline is that when you take a lot of this look at the State of the art for a lot of these individual methods and right now you can see the state of the art is actually getting quite high where we can say that the state of the art for a task like tokenization is essentially 100 percent right in English by construction because we define what a token is and write regular expressions that can identify them.

[00:23:02]
Taking now is up to about 98 percent. Any r. is about $93.00 for the f. score parsing consistency parsing is up to about $95.00 which is quite amazing and Co reference is the struggler in this bunch but it's still not terribly terribly on respectable that f. score of 73 now the problem with all of the.

[00:23:23]
These states of the art however is that they're essentially being defined on a very small domain of text and I don't mean just knew it was like c.n.n. and the Wall Street Journal I mean specifically the 1989 Wall Street Journal. This is one of the dirty secrets of n.l.p. that a lot of this these benchmarks are really been optimized toward one very narrow domain because of the rise of the pen tree bank back in the early ninety's a lot of other layers of antigens have been placed on top of this to get us on to notes and other sources so a lot of our focus still is on contemporary contemporary news from 30 years ago if you look at this measure of attention we see the Wall Street Journal getting probably the most but there is a long tail of other domains that have been getting more attention recently including Twitter product reviews it's down this long tail where we start getting other languages like French and Mandarin but there really exists a very long tail of domains John rose and even individual authors like Mark Twain where we really like this entire n.l.p. pipeline to work better on because one thing that we know in looking at a lot of these different tasks in LP is if you take a model that's been trained on one domain and apply it out of the box on another you see an incredible drop in performance right so this is true for English par speech talking taking all trains on the Wall Street Journal in this work is the medicine I am giving you gets an accuracy of 97 percent of the Wall Street Journal take that same model use it to take Shakespeare accuracy dropped 80 percent you take them all trained on modern German newspapers the accuracy repeal is taking is down 87 percent use that same model applying to early modern German it drops to 70 percent.

[00:25:06]
Take a model trained on the Wall Street Journal and apply it to Middle English the accuracy falls by 45 points there is no reason why this should work for Middle English Milledge is nothing like contemporary English but we can still see that this drop of accuracy is really calamitous as is true for a time per speech tell you moving from news to Dante is true for English moving the Wall Street Journal to Twitter is true for English any are moving from news to Twitter it's true for free structure partially moving from the Wall Street Journal to medical text moving from the Wall Street Journal to patent data and moving from the Wall Street Journal to make it right across all of these different pairs of domains we see a drop in performance of about 20 absolute points effectively rendering a lot of these tools unusable this is the fundamental problem that we have now we have a lot of strategies that we could use to address this problem right so this includes work and so trying to take them all it's been trained on one domain adapted slightly so that it works on another domain right either given a small amount of data in the new domain or no domain of data at all.

[00:26:10]
The use of contextualized word representations really has mitigated some of these effects right if you have a model that really adapts your representation of a word to it specific context of use there really can capture a lot of nuance that you wouldn't have had otherwise but if you really care about having I don't he worked better for a specific domain and that domain happens to be literature one thing you can do the most straightforward thing you could do is just create more data in that domain so that's where we're talking about here we've done this year in this context for entities and for events also for co reference and quotation attribution in progress this data set here for entities and events is up on get how great now I'll give the link at the end of the talk but the rest of this work is going to be a probably published by the end of this year so all this is going to be up for you to be able to use yourself Ok so let's start 1st by talking about entities and literary entities in particular now we use Anstey's in this character work I.

[00:27:06]
An entity here is this fundamental atomic unit that we're trying to take measurements about we didn't define a character as being all of those Co reference mentions were each one of those mentions needs to be recognized on its own so the context of Tom Sawyer the n.c. literary entities here as people would be Tom that boy.

[00:27:26]
The old lady and even the room as being a Locations lash facility now a lot of work in LP really focuses on the specific sub task of named entity recognition right where you're looking for specific mention mentions of specific categories like people places organizations that are explicitly named And if you take this passage from Jane Austen's Emma these would be the 2 names that are pulled out of this passage right so Mr Knightley and Isabelle nothing else but we know however that there are a lot of other entities in these kinds of texts that are not named that are still going to be very important to reason about so even in this one passage we have the entities of Mr Knightley a sensible man about 7 or 8 and 30 that is 37 or 38 for those of us who speak modern English a very old intimate friend of the family the family Isabella Isabella as husband and the elder brother of Isabel's husband I want to be able to extract all of these individual common entities as well in addition to the name entities and we want to do this for lots of reasons like establishing culverts right all these things that are grouped in this box refer to the same fundamental entity in the family Isabella and Isabella has a husband refer to separate.

[00:28:41]
Now when we formulate it in this task of trying to recognize spans of text that correspond to arbitrary entities we can't rely on the same assumptions that named and the recognition does in terms of having flat non-overlapping structure because even in this context here of the older brother of Isabel's husband we have a nested entities that we need to be able to recognize So in this context we're in this different application area of nested entity recognition which entails slightly different methods for many are identified so we've done here is annotated a new data set of about 200000 words of 100 different books they're taken from Project Gutenberg that come from a mix of literary styles including a high literary style.

[00:29:24]
Of Horton's age of innocence Joyce's Ulysses and popular pope text to including King Solomon's Mines in Algiers Reykjavik as he read King Solomon King Solomon's Mines. It's essentially Indiana Jones story that Indiana Jones was based on Reagan Dick is one of the most was one the most popular artist in the early 20th century all of his books and he wrote like 100 of them are all essentially the same they're all rakes to riches stories about kids who pick themselves up by their bootstraps and learn to make in the world so we have this really broad range of styles that we see being captured here to really get at the different ways in which we can see these ensigns being expressed like the 1st 2000 words from each of these texts and then annotate them for all the entries that they contain and the n.t. classes that we're using the same ones that were originally used in the ace 2005 specification for n.p.r. They include people so single persons with proper names are common entities or sets people organizations for Morgan associations vehicles like ships and.

[00:30:28]
Wakens and carts and things and locations are worth things are getting interesting from this perspective of entities because we have a distinction here between geopolitical entities right those entities that contain a population a government a physical location with political boundaries located natural locations that have physicality but don't have political status and facilities and these include a man.

[00:30:54]
Structures that are built by people that have a level of granularity greater than a single room in a house right so a kitchen would be an example of a facility a house is a facility a street is also an example of a sort of a facility is being built by people.

[00:31:10]
And there are a couple places where the kind of choices that we make for how we go about entertain these entities becomes really complicated by the fact the ranting literate Texan particular one of these comes in the context of metaphor because in a lot of cases for the center news data sets where you see these kind of nested answers being annotated it becomes very straightforward how you go about annotating sense is that I have this kind of copier structure right John is a doctor is very straightforward and tape because John is a person and a doctor is a person right so the things that are predicated of a given entity tend to be of the same class as an entity in literature we have complications like this the young man was not really a poet but surely he was a poem so we don't hear that the young man is a person we know that a poet is a person is a poem is a poem because and become a person because it's asserted of this individual in our case we make the decision that we only annotate phrases whose types did note that entity class so here we would say a poem is not an example of a person we also have complications from personification so this here comes from schools of black beauty as soon as I was old enough to eat grass my mother used to go out to work in the daytime and come back in the evening the red black beauty what is what's the problem here who is my mother.

[00:32:39]
It's a horse Yes So in this case you we make this the assumption that a person in our definition of it is any character that has either internal monolog or some evidence of cognitive capacity writes a good characters who have engaged in dialogue or we have some reported internal monolog right regardless of their human status so this importantly includes aliens and robots in science fiction as well I think gauge in dialogue we treat them as being people because we care about them as being characters Ok so in the end we have about 13000 aunties that we've annotated according to the standards in 200000 words and by far the most frequent of these are people and effectively look at the distribution of these entities in with respect to existing news datasets we see a very stark difference right we have a lot of people being mentioned here we also tend to have.

[00:33:34]
Many more people as we can see here in this person category compared to the Ace annotations for news where there are many more people being mentioned in literature compared to news there are many more mentions of geo political entities in news compared to literature we have many more mentions of facilities like houses and rooms and houses than literature does or than the news to us and so given this kind of distribution we can ask this kind of question of how well we can find these empty mentions in text as a function of the training domain and if we didn't go through this process of an attending all these texts How well could we have taken a model train from new to recognize these entities in our particular dataset so to us we're going to take a model from a 28 team a layered by directional.

[00:34:19]
State of the arts on East 2005 and then evaluate this performance difference when we alter the training in the test domain. So if we take a model here this by the rational. That's layered training on Ace of news and evaluated on Ace we end up with an f. score of $68.00 which is more or less state of the art for this particular problem if we take this same model and then use it to evaluate on our new literary data we see again a drop in performance of about 25 points right again effectively rendering these kind of tools more or less useless for this kind of task right so dropping down to a half score 45.7 If however we take a model that's been trained on the training portion of our data and evaluated on a test portion so the domains are the same we see an increase in performance I guess it's back up to the level of the original a.

[00:35:12]
Valuation and in fact if we can go one step further and take a model that's swapping out the Burt So the static word I'm betting with Burt concessional embeddings we see an increase in performance of about 9 points here right just by Swat making this one static change taking static we're inventing swapping them out with an actual ones lead to a dramatic increase in performance now we can also given this kind of model we can also ask what is it learning about itself that we can tell distinguish its own performance from that of other models trained on different things so do this we take entities in a 1000 do good work text using these 2 models one trained on news one trained on literature and then analyze the difference in frequencies with which a given string is tank as a person under both models and we're looking at the difference in which strings are take more under the model train on under literature compared to news what we find is that the strings or take most include the term misses the 2nd most common is miss the 3rd most common is Lady the 4th most common is that.

[00:36:15]
Now we can see why this might happen because when we look at new is there just tends to be many more mentions of men in this particular dataset that Ace is defining in training and the recognition models on compared to literature or literature has many more mentions of female characters as opposed to male characters so to make this more concrete as a testable hypothesis we want to see here was how well does each model identify entities who are men and women right is there really a disparity in the performance when we look at the accuracy of these different models with respect to them being able to recognise men and women in these kind of data sets so we do this we emptied the gender for all person entities in that test data that we've just been testing on and then trying to measure the recall of each model with respect to those entities and we find here is that a model trained on Ace and of having a very stark disparity in performance in terms of whether or not it can recognize men and women so the recall for a strain on news for men is about 50 percent for women it's about 38 percent So overall we know this performance is not good right compared to overall we know that it generally cannot recognize entities that well but even given the fact that it can recognize and these wells it recognizes women even worse than it does for men right by 10 points take your mo trained on literature we see overall a much better performance right as we expected we saw before but importantly in this context we also don't see a disparate impact to being able to recognize men and women right there more and more or less equivalent in terms of our ability to recognize them both so not only can this this kind of data really help us.

[00:37:50]
Measure these kinds of spans of entities in this new kind of data set but also has better performance in terms of their lack of disparity between these 2 different classes Ok so let's talk about events now as being this other form of plot structure that we know we need to to get at the question yeah.

[00:38:10]
Yeah. We haven't tried that yet we also haven't tried using the bird concessional embeddings on the ace to see how well it does on literature I respect though the drap would still be definitely close don't think you were close by 25 points Yeah Ok let's talk about events because there's been a ton of work in LP over events overall there's a ton of work in thinking about events as they show up in news for event trigger detection for slight feeling for measuring whether or not a given author is has a specific belief in the factuality of that event there's lots of work in grounding events in time in constructing narrative chains and inferring events schemas in an unsupervised sense and also work here in these labs in trying to use events as a way of conditioning a new text write start from some a vent and generate text that corresponds to that event the kind of structure that we want to capture.

[00:39:09]
There has been a lot of theoretical work on this too right in terms of conceptualizing what the metaphysics of events looks like that goes back to work. On Russell Whitehead Quine all the way up to the more recent work by Davidson and dowdy but one way of thinking about it a very simplified level what an event really captures across a lot of these different formulations is that event really is a thing that happens right under this realistic view of you see a run or adopt this realist view that we care about events that actually transpire in the world and event is a thing that happens I think that happens with reality.

[00:39:45]
Events in language are typically realized through verbs right so you often see them in this context if he walked on the streets but there's lots of other ways in which we can see an event be realized as well and event can also be a nominalization like he had a nice walk that fundamentally corresponds to some real action that transpired in the real world.

[00:40:04]
Now in our context we care about those events that are depicted as actually happening in the course of literature there is no real world here that we can ground any of these events to it they're all still this depicted imagine space we want to have a way of thinking about this distinction between events that are depicted as happening versus those that are depicted as not happening and thankfully we have a lot of formal theoretical structure that we can use for making this kind of distinction one is to facet these kind of events by their polarity so polarity specifies whether an event is asserted as happening or is asserted as definitely not happening so in this context here John walked by Frank and didn't say hello we know in this case the walking is depicted as happening as happening but the saying is depicted as definitely not happening right so it's not just that we have this an action that is a verb in this case where we have 2 verbs that correspond to events we care about just those ones are depicted as actually transpired the walking here are not the same.

[00:41:06]
So positive here is a picture is taking place negatives depict is not taking place tense also gives us some information right so I walk to the store and will buy some groceries the walking happened but the buying has not happened yet right it may happen at some point in the future but at the moment of articulation of this sentence the buying has not happened only the walking house.

[00:41:28]
Specificity also is another dimension that we can use for fastening events in this more complex way so specificity captures whether or not a given event corresponds to a space an action that takes place to with respect to a specific individual at a specific place and time or if it describes an action that that.

[00:41:50]
Involves an entire class of individuals so in my case here my son just watch frozen as an example of a specific event because it involves my son who is a specific person well if I say kids like frozen that isn't is not an event that actually has transpired in the world right it's a predication we say about a general class of individuals and not something that happens to any specific one individual so singular a specific here is a singular occurrence in a particular place and time generic is a claim about groups or abstractions the more complex facet here is the mode the epidemic modality of verbs and in this case the episodic loadout he gives us a way of deciding whether or not the event is asserted as actually happening in the epistemic reality of the con the pragmatic context so if I say I walk to the store to buy some groceries again the walking happened but the buying here has not right signals my intent but it hasn't specified whether or not I have successfully bought those groceries there may be some evidence of the future the next sense may say I did buy those groceries and we can say this in this case the buying happened at that point but the moment of articulation here this buying is not a surge it is actually happening yet.

[00:43:03]
But now we come lots of different ways in which we can express this kind of epidemic modality one is by looking at beliefs right so in this case rumors of my demise the demise here is not a sort is actually have the person has not demised it is only a rumor of that hypotheticals if you visit Berkeley tries try cheeseboard the visiting here has not taken place yet John was ordered to return the book or face a fine the ordering happened but the returning did not or it's not asserted is actually happening you might threaten to release the photos the threatening took place but the releasing did not desire so she wants to go to Rome the wanting takes place but the going to stop right is not asserted is actually taking place here so we have a couple different dimensions at this point that we can go about annotating individual words if it will tokens in this same data set for whether or not they correspond to the same category of reality events or the vents that are depicted as actually happening in the literary space of the novel so what's we have this kind is a we can carry out the same predictive task for 2 different center for a given sentence here can we go about predicting whether every individual token corresponds to a specific event so in this context here of David Copperfield my father's eyes and close upon the light of this world 6 months when mine opened on it in this case here we want to be able to make a prediction that the closing and the opening are both events that actually happened while in a separate sentence like Call me Ishmael we want to predict that the calling did not happen because it's an imperative or it's not something that actually is being expressed as really happening in the context of a novel so what we're doing here is develop a couple of different methods for going about predicting whether every event has actually transpired in a given these days has that we've added to it so the simplest baseline is a very simple one but also one that many people who aren't skilled in he would just immediately jump to right and say every verb that we see in a book is going to be defined as an event.

[00:45:03]
We can measure how well this performs by comparing against a test performance and ends up getting really good recall quite high recall but not perfect recall right again because 23 percent of all tokens that are events are not verbs right there now and other forms of expression is kind of action but the score here is also quite low 29 if we take a model at speech arised right and at some learning on top we can have a way of feature rising every individual token according to the word identity the part of speech the context around that word in terms of unit rounds in a bag of words and grams and generally we can use information about the word net since it and its place in the Word and hierarchy we can also use dependency information including that depends the label and the head of that word and also information about whether or not the the noun phrase that it's a joint to is a bear plural or not this gives us in principle some information that's useful for deciding between generic and specific mentions we take this model run it through a symbol to regularize it is regression we see that it ends up performing much better than a simple basic right up to about 57 points for Nascar but it's still a very simple mom I saw a very futuristic where we can then do is have a use a more complex machinery in this context by directional else him with the c.n.n. on top of it and a subway at c.n.n. to give us a representation of words that we don't have word embeddings for in a word I'm betting is that we've learned from a domain specific collection of $15000.00 English Project Gutenberg text and also add a sense level c.n.n. on top of this.

[00:46:43]
What the structure looks like here is that we start out with the individual we're top of the bottom have a word I'm betting plus the representation from the sub word character c.n.n. put that into a bi directional s.t.m. and output from that a direct representation to the prediction along with the c.n.n. that scoped over all of those up which for the entire sentence if we have this kind of representation and see how well it performs we end up seeing that it does about 8 points better than that feature I smile by so having this kind of contextual information that's comes to us from its use in the context of a bi directional s.t.m. along with these representation of word embeddings that we get from good ends up to a pretty.

[00:47:24]
Much better performance than if you try small again we can then substitute these word embeddings that we get from Project Gutenberg text the static Meddings swap them out with Burt contractual embeddings and see how well this model performs and what we find again is the same dramatic improvement there when we swap out these burden betting's that are not even fine tuned to the domain or the tasks just swapped out for the birds for the.

[00:47:50]
Second betting's we see an increase in performance here by 9 points so once a week from all of this work is there Burt is pretty amazing for a lot of the sense level production tasks right increase in the performance year across these domains by about 9 points on average.

[00:48:06]
Ok so what we want to do with these kind of events as a representation of data one thing that we could do the simplest descriptive case is just measure the abstractness of a novel by the density of those Ryall subjects right so which novels have more things happening in our particular day I said we see that it tends to be more of the low brow books that have more things happening like the quest of the slow replace the Invisible Man right sort of action adventure type novels Well books like the business and Ambersons in our context had absolutely no real cements I had to pick some more high level narrative logical you know description of the state of the world and not any actions about specific characters in that world right at least in these 2000 words we can also measure the distribution of events over narrative time in our particular case events there are tend to be many more people who are a mention of the very beginning of a book and many more actions that happened toward the end so this year is a distribution of where these events occur in the context of a novel in terms of the narrative time they tend to show up much more people in the in the end finally we can also ask what the relationship is between these events an external signals like prestige questionnaire.

[00:49:25]
The name and number of the sorrow yeah so no these are 10 percent of all tokens in the 1st 2000 words. This is so this one is a function of 1st 2000 words this was a function the full text. Ok so the last thing we can do is ask if there's a relationship between these events and other extreme measures about these books one thing we can look at is pristine Rice a prestigious something that's been characterized in a lot of different ways in a computational sense.

[00:49:58]
Marco's a huge dollars at Stanford have defined prestige by inclusion in Stanford exam lists right so if you're a Ph d. student in English and Stanford if a book shows up on a list of books you need to read it has higher prestigious. Underwood makes a different claim here that prestigious is defined by being reviewed by an elite literary journal so you if you have a set of journals that you know are elite everything that day they mark as being prestigious as being reviewed is prestigious by default.

[00:50:28]
He is a great hero be an example of a book that has probably Lopresti should hasn't won any awards in that context while the milkman by Anna Burns isn't all that we can think of as having high prestige writer one the Man Booker Prize in 2018 so looking at just these prizes we can see that there is a just we can see it legally separate out these different classes of books into being high procedure low procedure in our case here what we're going to do is use this metric that Ted under would define right is saying that we have a set of authors there or did a find in this work as having the highest and lowest person as measured by the number of times their works are reviewed by these elite literary journals this gives us 10150 high perceived dolls and 188 low perceived novels and if we count up how many events happen in each of these different categories of these hyper season novels and locus these novels we end up seeing that hyper sees novels have a much lower ratio of relative untapped or they have much lower density of these kinds of actual depictions showing up in their context this year so on average high proceeds novels have events showing up at a ratio 4.3 percent well overseas ones have them happening about 5 percent So again these are the 95 percent comments intervals here which shows that there are significant but more than looking at just their individual ratio we can also see the distribution of these ratios over all these books there were evaluating and what we find in this case is that when we look at this distribution hyper sees novels have a lot of flexibility in determining how much events how many events they can see depicted so this ranges from anywhere from 2 percent to up to 6 percent right to a very low frequency of events to a very high frequency of events while low proceeds novels don't have this degree of flexibility in overseas novels never get below 3 and a half percent in terms of their density of these events and tend to be much more bunched bunched up together around the $4.00 and $5.00 point range one way of interpreting this as a low percentage novels tend to have less variability in their.

[00:52:27]
Action of these kinds of results of dense because they have the pressure of something always needed needing to happen or something always needs to be on the page to help capture the reader's attention in a way that high perceived novels just don't have. Ok so this gives you at least a sense of what we can do with this kind of these kind of methodology so we want to apply them to literature to help drive some insight into literary theory and literary history all of this data for the events and entities are up on get how great now and like I mentioned the we have a couple of other layers of mentation that will probably be putting out at the end of this year but with that I'm happy to take any questions with the time we have left Thanks thank you thank you.

[00:53:17]
Yeah. Well now. I was able to identify women almost as well as ours I see so we see there's less representation but we're still able to identify them reliably I think just because that the I mean you can think about this in terms of like the bias coefficient in a logistic regression compared to the future values of any individual feature that the base rate is important for some things like characterizing a depiction of how much attention they get but it's not important for discriminating between those classes where there's enough information at least in the textual signal of the sentence that lets us capture them with equal rates in terms of taking task Yeah.

[00:54:34]
Yes So the criteria that we have for this particular collection was for all books there the books had to be in the public domain so effectively making them publish before $923.00 is that we can release them to the public and within that we gave our and here's a lot of flexibility of choosing which kind of books I want and a tape so it's not we had strong guidelines for we were having a lot of diversity but it ended up coming out that way as a function the choice of the individuals made a point to there were also moving this also I have a different annotation project going on now that moves law this work to contemporary tax so end up I have publishing some work too that includes books are published after 1924 yeah.

[00:55:21]
Yeah yeah. You know I think this is the problem you see with any kind of task that involves humans as an interest I think one of the things that I like stress in my own and Opi class is that you know for a lot of these tasks that we work on an LP like pos heading parsing you know we're given data sets by which which which define a gold standard but it's not like those gold standards you come to us x. ne hello they don't just come to us given by some.

[00:55:54]
Infinitely knowledgeable source obviously assets are constructed by human judgments rights and so we are the ones as people who construct truth and because we're the ones constructing truth that means that there are potentially is a lot of ambiguity about what individuals think truth is so the fact that we have different people disagreeing about what a correct label means means that there's always going to be some upper bound on performance that's not 100 percent it's going to be a function of whatever you think that upper bound is going to be for people who agree on the value of a label for the same data point yeah.

[00:56:33]
I mean I mean. By Are all. Well. You called me and. Let me know why. You like it so in this case taking it so you mean like rebalancing the training it is that you have the exactly the same things that are being fed into the training are really just multiple copies of.

[00:57:15]
You know I would say it's either way I think that I mean that kind of method works in some kind of cases when you have really strong class and balance of the things that you're predicting but in our case here at all these kind of cases where you have under representation I think what you want to have is better diversity in the representation of those individual examples which you wouldn't have by just copying same examples twice yeah.

[00:58:04]
Yeah. Yeah so the center way of doing this for the any and for the any Our task is to say that you can either calculate test sort of macro level or micro level so in this case we could report all the scores for the individual classes and say How well did we do on recognizing people or facilities for locations or we can say as we presented this line with a given label that you get with a given span and you predicted did you get the span right and did you get the exact category right as the truth so all those count if you get that exactly right it counts as being correct and if you don't then it counts and Either way it counts toward the denominator incorrectly position Rico so it's a correct span and the correct way.

[00:59:16]
Yeah for sure so we have looked at that in the same paper where we describe what this. What this difference in the fact that we have a difference in the amount of attention to mail if you look characters a function the gender we also characterize what was different about them what they were doing that was different and how they end up changing over time one of the things that we observed was that.

[00:59:37]
Like this the physical descriptions of male and female characters and a being less discriminating if we try to true to a model that would take the actions of a character did and predict whether that character was a man or a woman we had as time went on that became a much harder task or even if we say take the same the same size dataset the same feature representation space.

[00:59:58]
As time went on men ended doing more things that were like women and women and of doing more things that were like that character interactions we had lots of character interactions what would be interesting to look at there you know yeah yes writes Yes yeah yeah so actually so we do have some work going on right now in trying to measure information propagation in social networks in novels right to see how a given piece of information can passed from one person to another and then to another person and in those kind of context we're also looking at the gender of the general dynamics of how information spreads it doesn't quite get at the Bechtel question about whether or not this criterion is satisfied in a book released gives us some measure of the kind of dynamics that happen in a group setting Yeah.

[01:01:00]
Ok thank you thank you.