[MUSIC PLAYING] AMANDA: Vivi, how's your summer been? VIVI: Good. AMANDA: What have you been doing? VIVI: I have been reading a lot. AMANDA: How much do you read a day? VIVI: I read about two hours each day. AMANDA: What books have you been reading? VIVI: The Twilight Saga. It's really fun, especially at night, because sometimes when I go to bed, I'll dream that I'm one of the characters. AMANDA: What if I told you there was a program that could read all the Twilight books in a couple of seconds? VIVI: That would be pretty cool. [MUSIC PLAYING] CHARLIE BENNETT: You are listening to WREK Atlanta. And this is Lost in the Stacks, the Research Library Rock And Roll Radio Show. I'm Charlie Bennett in the virtual studio with Fred Rascoe, Ameet Doshi, Amanda Pellerin, and the pronunciation of some strange word. Each week on Lost in the Stacks, we pick a theme and then use it to create a mix of music and library talk. Whichever you are here for, we hope you dig it. AMEET DOSHI: That's right, Charlie. Today's show is called The Distant Reader, or subtitle, How to Read Lots and Lots and Lots of Books. CHARLIE BENNETT: You have to love a subtitle like that. FRED RASCOE: Yes. It's another edition of our very occasional series of shows on how to read a book, or in this case how to read hundreds of books at the push of a button. AMANDA PELLERIN: We're going to be talking about the online text analysis tool called the Distant Reader. Fred and I interviewed its original creator, Eric Lease Morgan, about how this Distant Reader tool helps scholars discover new ways to engage with vast libraries of digital text. AMEET DOSHI: And our songs today are all about things happening at large scale. So answering big questions, finding meaning in mass amounts of information, and the discoveries you might make when you have the right tools. The Distant Reader is all about helping the researcher find language patterns that close reading might miss. It's zooming out to the bigger picture. So let's start with Bigger Picture by the Murlocs, right here on Lost in the Stacks. [MURLOCS, "BIGGER PICTURE"] This is Lost in the Stacks. And joining us online is Eric Lease Morgan. He is Digital Initiatives Librarian at the University of Notre Dame. He writes software in libraries, and he's the original creator of the software tool which is our show topic today, the Distant Reader. Eric, welcome to the show. ERIC LEASE MORGAN: Thank you. I'm very glad to be here. FRED RASCOE: So on our show, Lost in the Stacks, we've done a few shows about reading and the different ways that you can read. On your tool, the Distant Reader, it says-- on the website, it says, "it's a tool for reading, for using and understanding large amounts of text quickly and easily." And that kind of reading and comprehension and analysis doesn't really match with the word distant, when I think about it. So how does that work as a distant reader? What does distant mean in the context of the distant reader? ERIC LEASE MORGAN: Oh, that's a good question. The Distant Reader is a tool for reading. And the idea is, is that there's different types of reading. You read a soup can differently than you read a newspaper, differently than you read a scholarly journal article, differently than you read a novel, differently than you read, say, a billboard. They all have words on them. And you read them all. And then there's different volumes of information. A soup can only has a certain amount of data into it. A journal article has a different amount of information in it. A novel has a different information in it. But then if you, oftentimes, students scholars and researchers identify a corpus of content that they want to read. A student might have all the readings for History 101. And the PhD student might have all the things in their bibliography. And the scientist might have a literature review. And the humanist might have all the works of any particular genre. Well, at that particular scale, it's difficult to read it the same way that you read a soup can or a journal article. And the Distant Reader is a tool that enables you, or enhances your ability to consume, read, understand, analyze large corpora of stuff. Close reading is usually associated with analyzing and reanalyzing small parts of a text, a particular paragraph, how a particular word is used, how a particular phrase has changed over time in, say, a book. But if I have the complete works of Charles Dickens, it's very difficult to do that, because Charles Dickens wrote-- he got paid by the word, for heaven's sakes. He wrote a lot of stuff. And consequently, the Distant Reader is a tool that helps you read large amounts of content. FRED RASCOE: So I like that soup can compared to a large novel or something like-- or scholarly journal, something like that. I like that imagery. Because I understand, you're looking at the soup can. It's designed so that I can consume a lot of information quickly. Is the Distant Reader kind of creating a soup can-- I don't know what the word is-- visualization or product for quickly consuming something like a novel or a scholarly journal? ERIC LEASE MORGAN: The Distant Reader is designed to read things, multiple things. That's the first thing that you understand. Not necessarily a article or a book. You'll get more comprehension reading an article in the traditional way than if you use the Reader against it. It's about volume. The Distant Reader, on one hand, is akin to a back of a book index on steroids. The Distant Reader extracts characteristics of a large corpora and then provides a way to navigate through those large corpora, that large set of characteristics. For example, it will count and tabulate each and every word in each and every document that you give it. So think about 100 books. The Distant Reader can tell you where each word appears in the entire corpus. It can also tell you what kind of word it is. Is it a noun? Is it a verb? Is it an adjective? Is it an adverb? Is it a punctuation mark? Well, once you have this sort of information, then I can answer questions, like what is discussed in this topic, in this book? And one of the answers revolves around nouns. Nouns answer what questions? What do they do? Well, those are the verbs. Do they talk, do they say, do they travel, do they measure, do they use, do they consume, do they explain? Novels have a lot of said. Novels say a lot. [LAUGHS] You can calculate, count, and tabulate the adjectives. And adjectives describe things. And adjectives oftentimes have connotations. They oftentimes have positive connotations or negative connotations. And then you can begin to calculate, is this book overall positive or negative? The Distant Reader takes a pile of content and extracts features from the pile of content, and then allows you to delve and ask questions of those features, and then get the-- and then you have to answer the questions back. It's kind of like a glorified back of the book index. You might have a title index, a name index. But now I have a noun index, and a verb index, and a punctuation index, and a grammar index. [LAUGHS] AMANDA PELLERIN: It's very interesting to me. I happen to be a pretty slow reader. But I think one of these close readers, because my comprehension tends to be very high, but it takes me a while to get through things. And I'm wondering what inspired you to create the Distance Reader tool? And what benefits have you seen from your use of it? And how do you hope that other researchers or scholars or even the general public might benefit from using it? ERIC LEASE MORGAN: That's a question that I've been asked a few times. How did I begin this? So it got started decades ago because I was just simply curious about big ideas. But my skills at computing have evolved and such in the past year and a half that enabled me to actually implement the ideas. And you talked about slow reading and fast reading. This is not really about speed, but it is about scale. You can quote unquote "read" a lot of content in this way that you wouldn't be able to read before. And the tool highlights ideas that you may or may not have articulated previously, and then you go, yes, beauty is about nature. At least that's what the reading would say. And then you would go investigate and read closely about beauty and nature to try to understand what they were saying. It's kind of fun, for example, to use a concordance, and simply create a collection of stuff that's all about beauty. And I have now 150 books about beauty, and I can stuff the result into a concordance and search for beauty is. And you get a great, big, long list of sentences that describe beauty. And if you had a traditional regular old book, you wouldn't be able to do that. And this tool enables me to ask these sorts of questions and begin to find answers. And I want to emphasize here something that's very important. This is not a replacement for traditional reading. If this was a visual thing, I would stand up on my chair and jump up and down. This is not a replacement for traditional, nor close reading. It's a supplement. It's an alternative way. It's not right or wrong. It's just an additional way of consuming a text, and it works better with large volumes. AMANDA PELLERIN: We'll be back with more from Eric Lease Morgan about the Distant Reader tool after a music set. AMEET DOSHI: File this set under BH201.C3. [CORIKY, "SAY YES"] (SINGING) My family, my masterpiece unsigned. You just heard "Came So Far For Beauty" by Leonard Cohen. And we started that with "Say Yes" by Coriky. I don't know how to pronounce that band name yet. Those were songs about looking for many different answers to very big questions. [MUSIC PLAYING] FRED RASCOE: Today's show is about the Distant Reader, a text analysis tool developed by our guest, Eric Lease Morgan, of the Notre Dame library. Let's return to our conversation with him. So you mentioned that the advantage of this is the ability to go to a much larger scale. ERIC LEASE MORGAN: Yes. FRED RASCOE: And that is, it seems to me, in the realm of-- I don't know if it's exactly artificial intelligence, but it seems like it's in the realm of artificial intelligence. Because it's allowing humans-- it's doing something that humans could not possibly do on their own at the scale, that a single human could do. Do you consider this as a tool of artificial intelligence? ERIC LEASE MORGAN: The short answer, I think is no. It's not an artificial intelligence thing. But I've been writing software for 40-ish years. And early in my career, I did try to write artificial intelligence things, and they were called expert systems. Nowadays, artificial intelligence has become a new buzzword. And it came back. And a lot of the things that are called artificial intelligence are really machine learning, a technique called machine learning. And machine learning is a process of either classifying things or clustering things. And actually, this meter does use some machine learning techniques. But I did not write those particular techniques. Those were written by other things. Allow me to elaborate. To determine whether or not a word is a noun, a verb, or an adjective, that's a classification technique. I have four or five things that the word must fall into, noun, verb, adjective, pronoun, adverb, whatever. And then I have a classification tool that says, here's a word. Which one of these things is it in? Is it a noun, a verb, an adjective? That is a machine learning tool. And the Distant Reader uses the machine learning model created by these folks called spaCy. SpaCy.io. And they have created a machine learning model that classifies text for various languages into parts of speech. And they have also used this same model technique to determine what is a named entity, a person, a place, a dollar amount. And I use that model. And it is a machine learning technique. But it's not really, in my opinion, artificial intelligence. And so the Reader, I don't really think is an artificial intelligence thing. Instead, it's like-- instead of using a trowel to, or a shovel to plow your cornfield, which would take you a long time. [LAUGHS] I'm going to use a plow. Instead of a shovel, I'm going to use a plow. And instead of my plow with my horse, I'm going to use a tractor. I just have a different bigger tool to accomplish my task, and it is not necessarily-- this particular thing is not necessarily an artificial intelligence thing. I wouldn't classify it that, no. But it does use machine learning as a part of its process. AMANDA PELLERIN: How do you balance open access and scholarly independence with concerns about who uses the tool and for what? ERIC LEASE MORGAN: Oh, I'm glad you asked that question. The Reader is agnostic when it comes to whether or not it's open access or under some sort of copyright or under some sort of-- under some sort of licensing thing. It is possible to feed the reader any kind of narrative text. The form can be a PDF document, a Word document, a CSV file, a text file, an HTML page, an RSS feed, tweets, Facebook postings, blah, blah, blah, blah, blah. The Reader doesn't know about that. It ingests the content. It does its best to convert everything to plain text. It then does its natural language processing and text mining techniques against it. And then it outputs a thing that I call a study carrel. Those little boxes in a library where you bring all your books together and they're all in your little order and then you study them. It doesn't matter what kind of content it is. But one of the important characteristics of a Distant Reader study carrel is that you get all the data back. You get all the analysis. You get all the tab-delimited text files. You get all of the relational database. You get the text that has been converted from the PDF documents. You get that back. And in your study carrel, you also get a copy of the original content. And all of this stuff is a standalone directory. And that you can use the content in that directory on your computer, or it could be on a website. It is up to the person not to redistribute copyrighted materials. I'm a student. I go to JSTOR. Download, download, download, download, download, download, download, download, download. Feed the result to the Reader. The Reader gives you back a study carrel, you-- not the reader-- you would be committing a crime if you redistributed your study carrel, because it would include that copyrighted material. FRED RASCOE: So I can see some light bulbs going off in some minds that perhaps have a lesser regard for copyright than others. I could see, for instance, there's the famous site, Sci-Hub, which has the entire-- almost the entire corpus of almost all scholarly literature, from at least the past 10, 20 years, and much farther back in many cases. ERIC LEASE MORGAN: They have a very large collection of scholarly and licensed material. FRED RASCOE: Yes, and I can just see-- of course, there's the argument that it's against copyright to have all this. But I can just imagine the possibilities of putting the Distant Reader to the entire scholarly corpus. ERIC LEASE MORGAN: Yes. Well, the Distant Reader would probably break with that amount of content. It's really designed more for something that's much more realistic. It's not really intended to read-- while I have used it to read the entire encyclopedia-- older versions of the entire Encyclopedia Britannica, it's really designed to help-- again, the largest size of the corpus, it's really tended to do is kind of sort of a PhD bibliography. That's kind of sort of a scale. That is reasonable. It's not that so much that it will function or break. But I think that there's too much nuance in anything larger, that you wouldn't get very meaningful stuff out of. I would, instead, what I would suggest is that a student or a scholar or a researcher might identify a subset of Sci-Hub, and then feed that to the Distant Reader for computing. And I think the results will be much more useful and meaningful to that particular person. Because that person will have added context. They will have, this is what I'm interested in. I'm not interested in the world's knowledge I'm interested in what is beauty. AMANDA PELLERIN: You are listening to Lost in the Stacks, and we'll be back with more about the Distant Reader tool for digital text analysis on the left side of the hour. [MUSIC PLAYING] (SINGING) Two, one, two, three. SPEAKER: Hi, I'm Joelle Dietrich, and I'm an artist. I make work about moving around. You are listening to Lost in the Stacks on WREK Atlanta. (SINGING) When you were a child, you were touched by the muse, and she said you're on fire from your head to your shoes. CHARLIE BENNETT: Today's Lost in the Stacks is called The Distant Reader, subtitle, How to Read Lots and Lots and Lots of Books. The tool is available to all online at DistantReader.org. FRED RASCOE: Hey, you know, Charlie, I actually tried it out. CHARLIE BENNETT: Hey. I'm glad you did your research. How did it go? FRED RASCOE: Well, I decided to test this tool on some public domain texts that I got from HathiTrust. So what I did was I downloaded the complete papers of OC Marsh. That is a prominent 19th century paleontologist. Anyway, it was six volumes, covered a total of 3,000 PDF pages. And I uploaded all of those pages into the Distant Reader. CHARLIE BENNETT: Fred, I don't think there's a more on brand activity you could have engaged in. What did you find? FRED RASCOE: OK, so the Distant Reader told me that the most common noun in those papers was O. The letter O. CHARLIE BENNETT: Fred, I know you know this, but O is not a noun. FRED RASCOE: Yeah, I think it's down to some OCR errors in the PDF. Oh, OCR errors. My puns are running rings around you. CHARLIE BENNETT: Just stop, please. Just keep-- FRED RASCOE: All right, OK. OK. So the next most common nouns in the report were bone, species, tooth, and size. And all those make sense to me. Those words you'd associate with fossils. And as far as verbs, be-- the word, not the letter, is the most common verb. But there are also a lot of synonyms for discovery and description in the top 10. Like find, show, represent, describe, indicate. CHARLIE BENNETT: This is all very sciency. FRED RASCOE: Yeah, it makes sense. Also, the phrase teeth are very-- that three-word phrase-- "teeth are very" is the most frequently occurring positive assertion in this corpus, happens 13 times. And the most frequently occurring negative assertion is bone has no pit. CHARLIE BENNETT: OK. Well, I think teeth are very and bone has no pit are the two sides of the most recent Godspeed You! Black Emperor record. But what do you think you got out of all of this stuff? FRED RASCOE: I think it means we need to do another show about fossils pretty soon. CHARLIE BENNETT: Remember the thing I said about on brand? FRED RASCOE: Oh. CHARLIE BENNETT: File this set under QA76.9.D343K66. [NON-ENGLISH] [MUSIC PLAYING] AMEET DOSHI: Just heard "You Got The Look" by Prince. And "Transmitting Live from Mars" by De La Sol. [NON-ENGLISH] Those were songs about unknown source input, using tools to uncover its meaning, and the resulting realizations. [NON-ENGLISH] [MUSIC PLAYING] AMANDA PELLERIN: Welcome back to Lost in the Stacks. We return to our interview with Eric Lease Morgan about his Distant Reader text analysis tool. And in this segment, we steered the conversation towards preservation. Well, I'm an archivist, Eric. ERIC LEASE MORGAN: Cool! AMANDA PELLERIN: So when you mentioned archives, I was like, yay! ERIC LEASE MORGAN: This isn't archiving. You will preserve the content. You will have the content. It's not supposed to be out there on the net somewhere. The net is going to break. That your URL is going to go away. If you create a study carrel, you have the thing and you have all the data. And it is-- so it is kind of preservation. AMANDA PELLERIN: Yes, that's where I'm going. Good segue. Do you have an archive for the data? How do you maintain and preserve the data that the tool produces? And will that be openly accessible? Now, are you saying then it's on the researcher the individual researcher to preserve their own study carrel, or does that kind of-- will that live on the Distant Reader site? If you have some examples of what's been run in the past on your site, but I'm assuming that's not everything, and yeah. ERIC LEASE MORGAN: That's a fun question. I could answer that. But let me try to be succinct. The idea is I have to read a lot. I'm going to feed it to the Reader. The reader is going to give me a study carrel. The study carrel is a dot zip file. You can then download the dot zip file and you have 100% of the data. If your content was open access, you could put that zip file and the unzipped version on a website, and other people could read your carrel too. Or you could keep the study carrel on your computer and do your analysis that way. You could have multiple study carrels. And you could have a study carrel, one about truth, one about beauty, one about justice. And then you could compare and contrast different content and different study carrels. I am in the process of creating a library of study carrels. And each study carrel is a curated collection of study carrels, a library. And I will create study carrels, and I will create a library. And then those libraries will be-- they will have content, and there's no restrictions. They are available to anybody for any reason, because the content is not necessarily licensed. I haven't finished that library yet. It's all very fledgling sort of a workaround. The limitation is the computing. The resulting study carrels may be-- I create study carrels that are about 14 gigabytes in size. That's large. And to create this study carrels almost requires what's called a high performance computing system. That's a computer with many computers. And so I have I have access to about-- I have four accounts-- no, four accounts on three different supercomputing systems that I'm working this on. And I will create a library. And I'm kind of being called a supercomputer couch surfer. Because I move from supercomputer system to supercomputer system to supercomputer system. And once I get something that's kind of stable, I'll be able to have a library that lasts a long time. I do want to create a library. And the library will contain things like the totality of this open access journal, all the works by Emerson, female authors from 18th century. All these articles have to do-- these 50,000 articles have to do with COVID-19. So I'll create a library of different types of things. And the idea would be, oh, this is interesting. Oh, I can download this. Oh, I can use this. Oh, I would like one for my content. And then I'll provide a way for them to do that. That's the big plan. AMANDA PELLERIN: We've been speaking today with Eric Lease Morgan, Digital Initiatives librarian at Notre Dame and creator of the Distant Reader online tool. He writes software and libraries. Thanks so much for joining us today, Eric. It's been a pleasure to talk to you and learn more about the Distant Reader tool. ERIC LEASE MORGAN: Thank you very much. It's been a pleasure and honor to be with you. [MUSIC PLAYING] FRED RASCOE: OK, wait a second. That was supposed to be the end of the interview, but Eric wanted to add a clarification. We had been talking about the distant reader as a program, but that wasn't quite right. So we'll let him explain it. ERIC LEASE MORGAN: There's one more thing I'd like to clarify. Technology-wise. This is a high performance computing system. It's not a program. It's not a program. It's actually like two dozen different programs that are doing its thing. And in my mind, the software that I write is akin to poetry. It is intended to be read. It's like each program is really an instrument in a symphony, and they're making music, which is kind of fun. FRED RASCOE: Computer code as poetry. That sounds like yet another show idea. OK, now we can play the bumper music. [MUSIC PLAYING] File this set under PA4025.A5. [THE SERVANTS, "COMPLETE WORKS"] [MOWBIRD, "HOLY MOLY ME OH MY"] CHARLIE BENNETT: That was "Holy Moly Me Oh My" by Mowbird. And before that "Complete Works" by The Servants. And we started that set with "One Chapter In The Book" by The Minutemen. Those were songs about diving further into texts and making discoveries. [MUSIC PLAYING] Today's show was about the Distant Reader, an online text analysis tool created by Eric Lease Morgan of Notre Dame. AMANDA PELLERIN: Hey, Charlie. You know, Eric had one more thing to add after the interview was over. CHARLIE BENNETT: You have another clip from Eric? How much extra audio did you all have stashed away? Should this have been like two shows or its own series? AMANDA PELLERIN: Maybe, possibly. He was pretty interesting. But while the mic was still on and we were saying our goodbyes, I asked him if he thought the Distant Reader's automatic text processing and reporting tool was in any way similar to Cliff Notes. And we just had to share what he had to say. ERIC LEASE MORGAN: Cliff Notes is a type of reading. Cliff Notes is not wrong. Now, if you just use the Cliff Notes, you're doing yourself an injustice. If you just use the Distant Reader, you're doing yourself an injustice. The Cliff Notes is a different type of reading. Here's a funny story, OK? It's kind of weird. But a long time ago, there were these two friends. And one of the friends, his wife got kidnapped, and they took her away to a far away land. And the friend says, help me go back, get my wife back. And he goes, OK. And they go over there. And they siege this place for 10 years. And they finally don't win the battle and get the wife back until they make this horse. And this is really cool story. And this dude comes along and he says, I'm going to write it down. And people go, no, you can't do that. Why? Well, because I'll be out of a job, for one thing. And you can't write that sort of stuff down. He goes, no, I'm going to write it down. And then later on, there's people and they go, I'm going to translate this cool story from Greek into Roman. Latin. No, you can't do that. Why? Because things will get lost in translation. Oh, I'm going to do it anyway. Time goes by. I'm going to put it into a book instead of a scroll. Well, you can't do that. Why? Because you have to read from beginning to end. You can't just jump to the end of the book. That's not fair. Time goes by, I'm going to put this thing up a gazillion times. No, you can't do that. Why? Because you have to be special. You can't just give it away to everybody. And time goes by. And people say, I'm going to put it on a computer. No, you can't do that. Why? Because you have to have the pages and the smell, and I can't write in the margins. But look at all this extra cool stuff that I can do with it. The idea is is that the technology changes, but the story kind of sort of remains the same. And Cliff Notes is an example. Cliff Notes was a type of technology that was applied against the book. No, you can't do that. No, yes, I can. And the same thing with this Distant Reader. This is a different type of reading, a different way of telling and communicating the same old story. The woman is still getting kidnapped. [LAUGHS] They still have a coming of age stories. They still talk about love. They still talk about whether or not gods are pushing them around or not. It's just a different way of telling the same sort of thing. So the Cliff Notes is not a replacement for the reading. It's a supplement. The same thing with a Distant Reader. CHARLIE BENNETT: I love an extra clip. Let's roll the credits. [MUSIC PLAYING] AMEET DOSHI: Lost in the Stacks is a collaboration between WREK Atlanta and the Georgia Tech Library, written and produced by Ameet Doshi, Amanda Pellerin, Charlie Bennett, Fred Rascoe, Marlee Givens, and Wendy Hagenmaier. Today's show was edited and assembled by Fred, who spent a lot of time this week playing on DistantReader.org instead of doing, you know, your job, Fred. Sorry. AMANDA PELLERIN: Legal counsel and text analysis reports of viticulture and enology handbooks were provided by the Burrus Intellectual Property Law Group in Atlanta, Georgia. Did I say that right? FRED RASCOE: I don't think I could have pronounced those words right myself. Special thanks to Eric for being on the show and for developing such a cool tool. And thanks, as always, to each and every one of you for listening. AMEET DOSHI:You can find us online lostinthestacks.org, or library.gatech.edu/lostinthestacks. And you can subscribe to our podcast pretty much anywhere you get your audio fix. AMANDA PELLERIN: Next week is a rerun, and we'll be back with a new show the week after that. CHARLIE BENNETT: It's time for our last song today. That is, unless Fred and Amanda have any more interview clips. FRED RASCOE: No. AMANDA PELLERIN: I think we're good. AMEET DOSHI: All right. Well, we'll definitely have to get Eric back on the show. But until then, I say we end with yet another way to retell the story of Helen and the Trojan War, with rock and roll. This is Helen of Troy by the Git Gone Boys Right here on Lost in the Stacks. Have a great weekend, everyone. Get gone. [MUSIC PLAYING] (SINGING) Oh, there she goes, strutting. There she goes. And a face that launched a thousand ships.