Okay, Welcome everyone. All good with audio. Yeah. Wow. It's been over 18 months since we had a colloquium in-person with a regular audience and it's a real treat that our first speaker is going to get us off in the new normal is Alfred Spector. Alford insisted that I keep the introduction as short as possible. But it turns out he's had accomplishments along three dimensions, each of which requires its own introduction. I'm so academia, academia, we all know academia. Alfred excelled in academia. Here's a tenured professor at CMU. He is one of the oh geez of distributed computing and was duly recognized for all his contributions there, including being named fellows of ACM I Tripoli AAAS, and a member of the National Academy of Engineering. And receiving the Tamu Kanai Award also from I triple E Computer Society. Very brief, the nother dimension, entrepreneurship. Tom CMU, Alfred was CEO and founder of transact cooperation, which was built on all his deep expertise in distributed computing and was later acquired by IBM. Which is when I taught, first crossed with r for it. I was at IBM Research at that time. And so he had a very successful entrepreneurship experience. And the third dimension has been amazing. Executive leadership. First, that actually I can take the sovereign talking first at IBM as VP of Research for software and services. I was part of his division at that time. Lots of great story, this story, this story, the story. That's all you're going to hear about the great stories. So all good. And, and then later on in our product group, we have lots of interesting stories to share, as I said, but I won't get into those. After that. He was at Google as VP of Research and Special Initiatives, made a number of ground-breaking changes over there in Google's approach to research. And then most recently he was CTO and Head of Engineering at two sigma. Alpha has lots of opinions on many things dating back to work and distributed computing. But today he's going to talk to us about the opportunities and petals of data science. So please join me in welcoming outfit. Put on microphone if and as long as our audio engineer says Everything is okay, I will continue and I see a thumbs up, which is a good sign. So thank you all for coming. It is exciting for me. I've given lectures only from behind a screen and camera for a year and a half. And it is, I can tell you, I'm sure it's true for students. It's better to be in person, but it's so much better for lectures as well. So I'm delighted to have you as an audience and thank you so much for coming in this talk. What I hope to do is explain data science as a coherent disciplined here. It's a new word. It's only been around for 11 years, not to say we haven't been practicing aspects of it for a lot longer. But it's a new word and it bears definition and exploration, particularly given its importance to society. So I'm going to go through a number of things in this talk. First, the purpose of the talk is to define the field. The second is to explain that with, as with many technologies that are very powerful, there are enormous opportunities to it. But we can actually do a lot of harm and it's actually quite challenging to do well. And third, I do want to be constructive. So this is a university we're all trying to learn and do better. So I hope to teach how to apply data science more effectively. Now, finally, I have one other goal. I'll be meeting with some of you after the lecture. This is the result of a talk that I gave six years ago, what's called Version 3 at the informs conference. But also a year that I put into writing a book along with Peter Norvig, Chris Wiggins, and Jeannette Wing on the subject. And this is the first time I have tried to present the material. The light of about a person year of work of organizing it. So thank you for being Guinea pigs. And listening to this. I'll define the field, discuss its historic roots. This won't take too long. I'll be discussing an analysis rubric. So if you're faced with a data science problem, what is it that you need to do in order to effectively use data science? And then what are the challenges so associated will be going through these in detail. So you don't need to like madly scribble them down. Plus these slides will be available if you stick to the end, I'll give you the URL where the slides are available. I'll get into any way if you want, but it's certainly if you stick to the end. Now, if we switch from being computer scientists and statisticians and data scientists, and we think about society at large. I think you'd all agree there are concerns that society has about this field. You can't pick up the newspaper any day without seeing somebody concerned about the scale of accompany or what's going on in some social network. We'll discuss those, I will discuss those and actually make a few recommendations. Some, there are many more in the book. And then I'll try to summarize what I hope you'll get from it. So first, what is data science? So what you often see, it's the study of extracting value from data. And to be frank, I always found that a little simplistic because that's what we do from the age of three, right? We walk around the room, we feel something and say, oh, that's hard. What data science is somehow more of that, of that, more than that. In fact, I think there are two different kinds of activities that we do with data science. One is we try to gain insights from data. And two is we try to draw conclusions. So insights might be a new hypothesis. Maybe some understanding we get by manipulating things and seeing how X varies as a function of y or something of that form. Maybe if we're lucky we get an ah-ha, right? Like like Sir Isaac Newton got when he was measuring balls that he was dropping off a building or something like this, at least apocryphally. So the other thing we do typically in the computer science domain, but also in statistical domains as well, is we try to reach conclusions, right? We tried to make a prediction, right? Does someone want an ad? We try to make a recommendation. What video should they watch? We try to cluster like elements. Together. We try to cluster and the name them or classify them as a group. We try to transform them. Maybe transform English to French, which is now a data science application. Well, we try to optimize, create an optimization of, say, traffic on a road system or minimizing delay or something of that form. It's important to enumerate these conclusions, I believe in this field, because so much of the field is a boat. The techniques and the analyses and the discussion of how to achieve these kinds of conclusions. If you think about what you do in machine learning or in other disciplines in the field. How do we do these things? And these are essential in data science. Now the historical basis in our view is along one, even though the term only came about around 2010. Statistics starts in the late 1700s. And I was very interested. You can find the book online, Sir John Sinclair's statistical accounts of Scotland. You can find out from that book, child mortality rates. Anyone want to guess the difference in child mortality rates in the United States today than they were in Scotland of that era under the age of, I think it was five or six. I forgot the exact cutoff and want to guess the factor ratio, 50. Okay? Anyone else have another number? Okay, with that one's going to win if no one else says something else as the best guess so far, yes. I add another 0, you're closer. A 1000. So 500 is pretty much equal to 1000 for a computer scientist. So yes, it's a 1000 times better. That's what you can learn. Interesting, so lots of interesting data. And then a Breiman who was a fantastic mathematician at Bell Labs, actually as a statistician, began to realize that their methods weren't really applicable for this big data algorithmic upset of approaches that computer scientists could have and began to move statistics towards the Data Science direction. He should be credited with that as some of his disciples say at Stanford and other places. Operations Research, which was very important to me because of IBM's focus in that area. Started in the 1800s with Babbage of all people who developed, of course, he architected and almost created the first really mechanical computer. But, but brilliant. He was interested in optimizing rail transportation and postal delivery. So he was looking at optimizations, a sort of he's a forbear and operations research. And then in the third, in the late 30s, as we're looking at very big military operations, you have these military people thinking, how do you coordinate and optimize military actions with airplanes and people and ships and all these kinds of things. Computing were in court. Of course, the College of Computing here. There we know we have so much that contributes all the storage and engineering, machine learning visualization, et cetera. Our field had a lot of very interesting things that happened. So what many of you that are younger don't realize is that empiricism was not really a big part of computing in the era, say, of Professor glial or myself in the early days, we were algorithmic or engineering focused and not so much data centric in terms of learning from data. But things did, did come about. Turing wrote about it. And 950 Samuels who was at IBM, did work on chess playing that could learn from another, sorry, checkers playing. Moore's Law was of course a gigantic driver because we could never use all the data and machine learning without, without this exponential growth. Salton, who was a professor at Cornell, wrote about information retrieval and how to learn from the habits of people retrieving information. It's really what led to Google and its earlier days. And in fact, it was salt tins disciple ahmet Single led Google search for years. So we could go through all of these things. There isn't really time to go do it. But I believe this is a list of major activities as computing moves towards data and being data-driven. And I would call it broadly empiricism. And here's a sort of the original paper from touring. When he writes about G. If I could get, if I could just watch chess players playing against each other, maybe I could learn to play chess. And he does write about that. So as much as I tend to think of him as sort of the touring that the Turing machine and the importance of that theoretical basis. He did some work in engineering as well. He had some clever engineering advances, not as much as say Von Neumann and others of the time, but some. And then he did push on this learning aspect of things as well. So it's a good reason to think of him as a founder of the field. Now, the other side yeah. Question. Yes, sir. Okay. So the question is, why is this a science? It's it's a, it, it happens to be the name of the field. So I think you should view this as a noun phrase and not get into that concern now is their hypotheses and will we try to prove hypotheses? In fact, probably more so here than generically across the entire field of computer science. So I think the name is probably more deserved in data science than it, than it would be in computer science. But I would basically not get lost in that and just move to considering it as a noun phrase. I do not know. Scales important. These are what these data science, these data centers look like. So they are really big and we all recognize this would happen. And of course, the numbers are very large, right there, very large scales in many of these things today, even at this relatively small investment firm that I was at, we were when I was there a CTL, we were capturing 10 to the 15th bytes per month. So that means that that firm must be getting towards an exabyte of storage by now given scale and interesting phenomenon, why you can never delete anything, et cetera, et cetera. And so there's an immense amount of data. And if you just think about the numbers involved, you're doing something like 60 thousand searches per second around the world in a given day based upon reported numbers from Google. So these are vast abilities to learn things and we'll get back in that in a moment. So what I want to do now is sort of turn toward characterizing how to do data science. So the way that I like to think about that being someone who learns from example. And that is look at some representative examples of data science in induce a methodology for how to apply the field from those examples. So in our book, we do this with six examples. I won't have time to go through all of them now. Spelling, which you can understand, speech recognition, which we also understand, recommendation systems, clear, protein folding is a new one for us. It was really unclear whether protein folding would occur because of our detailed understanding of physics. And we'd be doing so-called ab initio physical modeling to understand the structure of protein from all the physical interactions, quantum physics in fact, or whether we would be looking at data. Well, at the moment, it looks very much like data's winning. Again. In this. An interesting one is many people have said, well gee, if we could just have fantastic healthcare records, we would understand what's going on with disease. We could look retrospectively at vast amounts of data what happens when people have taken aspirin? Do they have a lower risk of this disease or that could we find, oh, it's a very interesting and important use and a very, very difficult one as it turns out. And then predicting COVID mortality, we'll look at that one. And the first and spelling and recommendation systems. A little bit more detail. So spelling has switched from dictionary based approaches to machine learning and sort of information retrieval based approaches. It's a big problem. 1% of words and documents, 10 percent and Google queries are typically misspelled. You'd like to classify misspellings. Note my hand-drawn squiggle under misspellings, which is spelt wrong. That's a classification problem to be able to underline it and say it's bad, you'd like to make a recommendation of what the right words might be. You might like to just transform the word into the correct word based upon your certainty. So those are examples of the types of conclusions you try to reach with Data Science. From a perspective of this application, are there any downsides to this one? Well, Plato and Aristotle. They sort of, Socrates thought perhaps writing was bad. Because if you had to write things down, human memory would be deprecated in some way, we would lose the ability to memorize things. So the great poet Homer, presumably memorize the Odyssey in Iliad. They weren't the best of my knowledge written down. So he had a good memory and I don't think I could do that today. I can do the first line of it in Greek. That's it. So maybe it's bad. I don't think the arguments that are made about spelling correction work. I don't think they're making us stupid, but we could debate it as we think about objectives. I'm pretty sure it's not right. So in fact, I think it teaches spelling is my own guess. So very few downsides, many different approaches. And one thing to think about is data because of this, say 6 billion searches per day. If you have one new use of a word per million queries, a system can get 6 thousand training examples in a day. So when Barack Obama appeared on the scene of the United States, no one knew how to spell his name. A machine learning system or the systems than in US, they weren't really machine learning at that time. Probably learn to spell Barack Obama within a few hours, or could've. Music recommendations. It's an ensemble. Like many recommendation systems, there are many things you can do at the same time. To make it work. You can do collaborative filtering. If you like Beethoven. And, and you also like Mozart. I like Beethoven, I probably like Mozart as well. We can use semantic knowledge that can be learned. So you can find co references of Beethoven and Mozart and a lot of Wikipedia articles on the Classical period of classical music. And from that you could cluster them together and then understand something about recommending the two together. We can do audio signal processing. If we do almost any form of audio signal processing, this is a transformation, right, of music into some digital domain. We can do comparisons and find out there's similar. One can build an ensemble of these techniques. And there are many more things that go into recommendation systems. For example, optimization. In many times. At Amazon for example, as, as an example, you might imagine they're trying to sell things, right? Because you might pay extra to buy something. So then maybe you want to make a recommendation that might make more money. If you're at a music system where you're not trying to sell something. Maybe you want to provide something that's valuable to a user. If it doesn't cost very much to you, you're selling it on a subscription basis. So there are many different forms of objectives, an optimization that can occur and you end up building and a recommendation system. All the conclusions of the type that you need in data science. Now indeed, if we switched from music recommendations to news recommendations, it gets a lot more complicated quickly as I think you would all realize. And I leave that to you to think about how different recommendation systems have very different types of objectives that they require. So those are all pretty good cases. I mean, we're getting a little bit more complicated with news for sure. A COVID mortality prediction hasn't worked very well. I haven't been able to do it well. It would be really valuable to be able to predict mortality rates over time because we could guide actions and individual behavior. But it really hasn't worked that well despite a huge history. There was this, we've been thinking about this and smallpox in the 1700s. Their vast numbers of epidemiologists that look at various kinds of informed models, the so-called SIR models, which look at how many people are susceptible, at how many people have been exposed and unexposed, people that are contagious and who's recovered. And you could imagine building compartmental models with differential equations and expected prediction rates, infection rates, et cetera. And what they haven't worked. All of us have learned humility and modeling from this. And it was a very good paper published towards the beginning of this year that compared epidemiological models that can be looked at retrospectively and more or less none of them had any predictive ability beyond a few weeks. And very few of them performed better than a sort of naive model. And frankly, the relatively naive ones worked best. Not all the data in the world that we had didn't work. My own guess as to why they don't work very well as we have actually relative to many other things. Very little data in doing this. Here's a example of why it's sort of complicated. This is sort of the use of data science to gain insight. These dots and different colors represent at a different time period the mortality rate as a function of the vaccination rate and 51 states and the District of Columbia. So on July 11th, in blue, you'll see that there wasn't very many, weren't very many deaths anywhere in COVID, despite quite different vaccination rates ranging from it looks like 65% to 30 something percent, maybe 30 percent. And as this delta wave increased, you see the going from green to orange to dark are and skip the brown for a moment. I'll explain the brown line in a minute. You see that the differential slope increased. It became more important to have had a vaccinated population. It doesn't seem surprising that was more prevalent disease. Now is the wave is decreasing around the United States. The most recent data which I added yesterday for this talk is the brown line. And you see that the slope of the curve is decreasing again. So the notion here is that with a lot of data, you can begin to gain insight. And there were a lot of things you can think about in this that you can learn from it beyond what I've just shown. And you could spend a lot of time with it and gain a lot of insight. A very interesting question will be as well, the slope of the curve change and will it start increasing at some point? And it may do that. We'll see before the book is finally print it, I'll know the answer. Lots of areas of applicability. You see? I see. And I don't know the extent to which you yet do that. It depends on where you are and your degree and how much you've been thinking. But there isn't a domain where we don't think we can plausibly apply data science these days. And there's just a small set. And I could probably write down a 100 more and you can write down a 100 more in a short period of time. So with that as background, how do we think about this? Now? What you might believe as the best mathematical computery software ii statistic operations researchy kind to people is that it's all about the latest neural network architecture. And it's all about, you know, the best statistics and such. That's an important piece of it. But it's a much broader topic to do data science well, and I would argue that to do it, you need to think in terms of seven elements of this rubric. And I'll go through them in some detail as we consider challenges. So the first is obviously do have the data. And is it available to you in your application? And that's not so easy to answer for a lot of purposes. So for example, with COVID prediction, if we knew the location of everyone at all times and who they were near, we could probably do a much better job predicting COVID mortality and certainly infection. And consider the difficulty for privacy reasons and gathering that data. We can't gather it in this pluralistic society. Second is in what? In Australia perhaps, and in China you can gather it and you can't gather it here. And I would urge you to consider the pros and cons of this privacy focus that we have. I say that neutrally. Okay. It's a very interesting set of issues. Second, a technical approach. So is there an approach to solving the problem? Is there one that, that might work? Do you know what you might be able to do? Is there a machine learning approach or something? And that isn't always the case. For example, if distributions are non-stationary and things change very frequently, it's very difficult, perhaps apply data science and these sort of game theoretic situations. Third, and this is a huge amount of the effort, is, can you build the application dependably? And what I mean by that in the word, this is a defined word. So are the question on computer science can be asked about this as well. Is that a scientist the right meaning of dependable, but it's a defined term. By it I mean privacy, security, abuse, resistance, and resilience. There are many applications you can conceive of building that will not work if you consider these four topics. And I can give you lots of examples when I was at Google where there were things we knew we could not do because we could not achieve these. We didn't try. Fourth is now on the requirement side is, will you generate sufficient understanding from the conclusions that you've reached or the analyses that you do to allow people to accept what you're doing. So can you explain your result? Now that's not always needed, but it's sometimes needed. Can you provide, and the word causality should be they're not understanding. Can you provide causality? Can you show X causes Y, not just X is associated with Y. And finally, if you're in the scientific domain, can you provide enough information so that others can reproduce what you've done? In the grand tradition of scientific reproducibility. Obviously, the FDA today will not want to approve drugs or vaccines without reproducibility of the data are clear objectives. Do we know the objectives we're trying to achieve? And that may seem simple. If you're going to go do a startup, you might. So I know what we want to do. And I bet that if I pressure you and challenge you, you'll find out it's a lot more Chow and detail oriented than you thought to hone the objectives to be exactly what they should be that will survive for more than, you know, your first alpha release of the product. And then finally is the application tolerant of failures in this? And that's an interesting question. And then finally, and I think you may have seen this on the first chart up. Can we do this in a way that's ethical, legal, and respecting of societal impacts that occur. And you might say, well, why bother, you know, it's long as it's legal, it's okay. The answer is, in the long-term. It will be good if it's ethical. And it will be good if society accepts what you're doing, otherwise, it won't work. So these are the seven rubric elements and they need to be considered. So I'll go through them because each one of them has challenges. In spelling. There easy. Everything works and spelling and my argument unless you think somehow or other, that spelling correction makes you stupid. In which case, if you think that, then perhaps this is not such an obvious case. But all these things are pretty good. Seemed like a nice neutral application. And that's why we use it first. Yeah. So why did I mean what the definition of correct is? Majority driven. Right. So majority of people spent through THE you could be that the ordered correctly and maybe that's not aligned with the wives. But take it to another said majority of people that get paid if 99% of the people said the flat, right? So, so the question, so the question is, to what extent do you learn from interaction versus the existing corpus? So as I mentioned in this, the existing corpus certainly is part of the, of the data you learn from as our interactions. And the interactions actually you do bring up an issue. There's there's the potential for abuse, right. You could say, could all of us get together a consortium to convince a spelling corrector at Bing or Google to spell through incorrectly. Possibly, there's not much motivation to do that fortunately. But the answers you have to use both. And you must be very careful in this. And of course, this is one of the great risks that happens. The reason that search engines work quite well and have been relatively immune to that is they use PageRank and a lot of corpus derived activities, not just the immediacy of what people are typing and clicking on. So there's a balance and I think you're right and it's a very good example of why things are not so easy to do, right? So, so this is a very easy one and you can see how others get harder. And that was all really I can spend time on in this. So here are the challenges. So with data, you might say easy. But if you haven't heard the term data wrangling, that's the term that's used. The data operations of companies that rely on data science or organizations are huge. I would guess in many of the places I've seen 10, 20% of the work is on getting data. And we could go into vast numbers of challenges associated with, but let's say we're doing retrospective health care studies. What probably happens is data is encoded in one institution in a slightly different way that it's encoded in another institution. Which means that getting together the data to generate a statistical average or any form of conclusion requires transforming the data. One of the reasons that Israel was so successful in terms of getting good data on COVID vaccinations. They have four primary health care providers within Israel. Each one serve, Serve dot a2 million people or something plus or minus like that. And they had coherent data in each group. At America we'd have like all these different hospitals getting together, trying to pull data. Not only that, pulling data across institutions runs into vast privacy and regulatory issues, that makes it really hard to do. So I could give you that as an example. I'll give you one more. And that is, so imagine you're an algorithmic investment firm and you want to make split-second decisions based upon incoming data, whether to buy or sell. And imagine all of a sudden you get a data element of a stock price which is really low. What does that mean? It's a bad data element. Or this is the chance you've been waiting for and jump in quickly, use your performance system. What do you do? Well, many companies can go under by making that interpretation incorrectly. And I could go on forever. I mean, if you want data, it should be kept reliably. But if you keep it reliably, you keep it in multiple places. If you keep it in multiple places, how do you delete it? So all sorts of issues in technical approach and here, many of you know far more of an eye because you've taken a course in this more recently than I have. There are all sorts of issues, but it's, I think it's still clear, we don't really understand how machine learning works in many cases or why it works as well as it does. As based on this quote from Ali or HE, me, I've mentioned non-stationarity of distributions, risks of overfitting. Scale in these systems is actually very large. It's a significant power use to run models like bert over all the natural language in the world. So these are significant issues. There's the, if you, if you, if you want to produce answers with machine learning, oftentimes you can get means, but you can't get distributions of results. And we care about distributions because people care about risk. Again, go back to a financial institution. If I can tell you could make a lot of money on this investment. But 80 percent of the time, it'll be down a lot. You won't sleep very well. Because like why would you, why would you trust that? So you want to know that things are sort of bounded in the list goes on. So there are many challenges and technical approach. When I first gave this talk in 2015, I was very pleased because just at exactly the right time, adversarial networks research had come out and showed that a lot of machine learning algorithms of the time would recognize things like the black and orange stripes as a school bus. And not only would they say, I think it's a school bus, they'd say, I am really confident this is a school bus. And that is persistent. There's no complete solutions to this issue even today. So it shows you a kind of very specific, but at least colorful example. Now, in dependability, there, there's so many issues in this lecture series unto itself to talk about all of these topics, maybe four, to do it. When I think about privacy more as a practitioner, then as a privacy expert, there are five things that, that people are really concerned about. They're concerned about the collection of data. So the mere collection of it by someone is worrisome. They're worried about its storage. And in fact, that's often regulated periods of time and how et cetera, they're worried about beyond those things, will it be maintained confidentially? So I might be willing to say to VBAC who I trust you can record this, you can you can store it, but don't tell someone. Right? And that's confidentiality. Then there's the question of, can the data be used for myself in any way some system wants to do it? So are we allowed t is a system a computer allowed to use the data to like, fix spelling for me or make advertising recommendations to me. Interesting question, right? It's, it's perplexing to people on the one hand, they want recommendations that are good. On the other hand, sometimes they feel their spooky and don't like them. So there's the usage for self and then there's the question of usage for others. Now, of course, that if you're maintaining confidentiality has to be done on a confidentiality preserving way. So our thinking about this is how do you regulate and define and manage these topics? Now in security, they're all manner of issues. In this. Security is the hardest problem in computer science, in my opinion, we will solve. I say this with a security expert in the room. We're going to solve artificial intelligence before we solve computer security. So that's how hard it is. So we'll have AGI first. And we may never solve computer security. But you can't go after certain problems and Data Science. Recognizing that the systems are going to be in secure. Are you gonna do people great harm? Can you make systems resilient as we optimize the heck out of everything? Will they still work? And what will they work with enough confidence that people won't, won't suffer failures because of it. And then last but not least, recognize when we take many systems and put them in the wild, people will abuse them. They will do things to like teach misspelling to the system. Or worse, a political candidate elected and the like. So can you handle these issues? Explanation we've covered so I won't spend too long about it and understand that. But can a system say why? So imagine that you're in front of an automated parole system and the system says, you're not eligible for parole, and you say why, and it delivers you a bunch of coefficients of hyperparameters and such things in a neural network. Not very good. And it has real practical implications. If a system doesn't work, right? Let's say you build a search engine and dependent upon machine learning and it stops working, how do you debug it? What do you do? Do you like retrain the entire system? What data do you omit in the retraining because you can't even know why it failed. It's very difficult to prove causality and experiments that are done just from existing data. And if there's any big issue that I would beat the drum on, it's to remind everyone that as easy as it is to allow associations to convince you that x causes y. Unless you can really prove it. I don't, I will not believe you and you should not believe yourself. It is too easy to get caught in traps in this. And the best example is of the use of estrogen replacement therapy for women. We did a retrospective study and found out that women that took estrogen post-menopausal Lee, had a seemingly lower risk or they had a lower risk of heart disease. And this made sense, right? We thought we could understand the causal reason. Because women go through menopause have lower estrogen. Turns out to be self-selection bias. The women in the study were getting better medical care probably for other reasons. So it wasn't really true and we stopped doing that. But if I almost challenge any of us, would we not have reached the same conclusion from that retrospective study and would have been wrong. So we must be cautious and tell others to do that. And then reproducibility. How do you release the data? So if you're an Israeli study on COVID and you've gotten all this data, can you release enough of it to convince the FDA that you're right? It's very difficult because of confidentiality reasons. Aggregation, Facebook tried to read, read, release a lot of data. And you saw worth that, got to them. And the Cambridge Analytica story, they were actually sort of trying to be nice in that instance. And it was a big, big mistake. So consider the role of data scientists in this. How careful we need to be. If you know regular scientists. There's so many reasons why mistakes can get made using data. Just for example, if you average scientific studies results, you're going to average the results of positive studies and not include the studies that were never published because they were negative. Everyone knows this, but it always happens. So I've looked at 20 studies, I published the average, that's gotta be a good result. No, because the studies that were not published probably didn't work. And you should include those results. The national government hit in the US has tried to fix that by having a repository of all medical studies. But I don't think it's yet having a big enough impact. And then of course, journalists, I'm afraid, don't understand enough about causality easer either. And they're trying to find solutions to things that are a little too easy. Do we know the objectives and what we're doing? Very hard problem as well to get clear objectives and systems, competing objectives are prevalent. Consider the Chinese social credit system. You could argue this is a great thing because it's going to maintain societal coherence will nudge the public into being coherent with greater societal goals of China. Might like that. If you don't like that, do you believe we should nudge the public to drinking, to not smoke or not buy bottled water. More of you probably believe that. So how do you define what's right and what isn't right? In this? And in fact, we can quantify trade-offs with data science and who's in a position to actually resolve that should a school district optimize for the number of students that couldn't get into Georgia Tech. Or optimize for the number of students that can achieve a minimal score on a high-school math exam. And we can probably quantify that. We have a chance to and as a school district going to be able to make that decision, which then our systems can then go implement. We can't usually guarantee results even when we have resilient and good systems. So can we actually used as data science in certain areas, like in certain health applications? Will we get there with Level 5, self-driving with current approaches will see no one really knows for sure. Now, I'm going to just switch to the ethics issues because of a lack of time. But there are economic and legal issues as well that are complicated. We recommend reading the Belmont principles, which are the principles that you should consider our respect for individuals. And that often equates to informed consent and medical trials. Beneficence, you're trying to do good overall for people and justice. You're trying to balance the benefits across multiple segments of society in a, in a fair way. And it makes sense to think about that as you build a data science application or do analyses of it. Now, these are not the only objectives that you'll have, right? If we're realistic and the Belmont principle, people understood it. If you're a researcher, you're trying to get tenure rights. You're trying to do research in an area where you will be impactful, not necessarily in the area that might save the most lives on the planet, right? So there are lots of competing incentives that we all have. But it makes sense to look at the Belmont principles and think about how we balance incentives. How we have governance of our institutions so that we balance these incentives in a reasonable way. And then how we actually apply build in checkpoints into our processes that we consider. The principle is just like say, IRBs do in medical trials. Now, I'm not suggesting that the New York Times have an IRB every time they, for example, publish a headline, which might cause societal discontent. But they've probably should think about what the headlines effect will be. And I suspect that they do. And they think about processes for evaluating it as should the rest of us and everyone else that uses data and draws conclusions. So as we turn to the last part, society. There are lots of things. Data science is important now. So here's some questions. Should the best machine learning system when an election? Is that how we want elections to be decided? Don't underestimate the importance of this and modern politics. Which candidate can have the best targeting of voters in a variety of ways. Our video games, fun or addictive, read the latest press on. You all know this on video games and China and such things. And how do you deal with this issue? Are personalized recommendations helpful or manipulative? Are, can we optimize government policy and is that good for society? Or is it what libertarian might call collectivist tyranny? Are there as we think about the economics of these, are we generating growth and well, for are we generating excess concentration of wealth? And I'll give you one last example. And that is if we build a word processor that automatically locates reference material for a point that we make. This is a great and LP problem. Are we creating better documents or we just building automated peach picking? I change cherry to peach on your behalf here being in Atlanta. So we just building an automated cherry picking system that will find arguments to prove your point whether they should be used or not. And it's a good question. So the stakes have changed. So my I won't have time to go through it, but there are you just look at the column on the left here if you can see it. There. Economic unfairness issues that are on people's mind and they relate to people and to institutions. So our institutions getting too big or people not doing well because they can't leverage these technologies effectively. They're worried about personal data in a variety of issues, which I mentioned mostly earlier. They're interested in what's happening societally. Are we leading to Discord? Their recommendation system is generating filter bubbles or things of that form, or actually are they doing better than the old rags? Remember, a Pulitzer and Hearst probably have a large, there's a large way to argue that they generated the annexation of Puerto Rico through their hyperbole in the turn of the century, through arguing that Spain had attacked the United States. So maybe this is worse today, maybe it's not worse. But they're interesting questions. People are concerned. Some people are beginning to be concerned about environmental issues because of the power consumption of some of these machine learning models. In there's a lack of trust broadly about what's happening because people don't understand. There are many things you could think about doing in this. We make 14 recommendations. This is quite tricky for basically a computer scientist because I'm not a public policy person. But here are some of them that I'll describe to you. So first is up and I'll beat concrete. So we need to broaden data science education, which means we have to do it in, in high school or before. So I'm on record now publicly. I think for about the first time to say I would substitute some of us, I would substitute a lot of statistics and some programming for trig and key part of calculus, if not all of calculus in high school. And this does not mean that, you know, someone who wants to be, go into electrical engineering doesn't need to take calculus and have to do it. But I think for most people, they would benefit from that. And I think there's enough of an intellectual body now this is hard to achieve, but we talk about how to do it in the book. We have no vocabulary. I don't know if you've noticed this, but if you get into an argument with someone on privacy, I rather doubt you're talking about the same thing as that other person. You might be talking about confidentiality. They may be talking about the fact that something showed them an advertisement for something that they thought was spooky, which might have nothing to do with confidentiality. Only they saw it. So we don't have good vocabulary. We need to create the vocabulary to discuss these issues that I've been bringing up. There are two things in technology and data science that are generating large-scale technologies always had some economies of scale associated with it. Look at the size of IBM AT and T, General Motors, they've been economies of scale effects and big companies for a long time. But with data, there's the virtuous circle that a system that gets more data probably works better, gets more users and gets more data. And that circle continues. And now both of those things are happening. And what is the impact of that? The world is an interesting question. It's interesting to think about what to do about that. We don't have a specific recommendation, but we think that you need to look at those two effects as they combine. Three more. We honestly submit. And this is what I would do as a product manager at a company, you should use this rubric for checking the boxes on your considerations. Certainly considering the ethical side of things as you do it. We all need to be truthful. If there's any ethical exhortation, I would make its be truthful. So it's easy to say to a civil engineer, right? Don't produce a bridge where you're concerned that the finite element analysis isn't right in the bridge might fall down. We shouldn't be writing computer programs are doing statistics or generating conclusions that we don't feel good about. Now, there are other ethical issues that are more complex. But that's the basics, is we should be telling the truth under all circumstances. And I'm not sure that's always been done. And finally, we do need to create organizational approaches to considering ethics in these decision-making. There's no clear answer to many things and people will even reach different conclusions. And you can see them in the press all the time. But we need, we need to raise the debate. A decision has to be made and people will have to move on after that decision gets made. What am I hoped you take from this talk? There's a coherent field here. It's transdisciplinary with, you know, made most important OR because of its optimization focus in statistics and computing. There's a lot of ds plus x, or some people call it X plus ds. A lot of the relationship of every other field where data science is applied. There are many great opportunities. I hope you get the impression that I'm positive about data science, but it's challenging. So the good news is if you're interested in the field and you want to go do good work, there's plenty to be done to improve it. We offer you this analysis rubric as a way of thinking about things. The practice of ethics should be a focus. You can do it, you don't have to be some cone head philosopher to do it. Society has concerns about us. So when I was a young students V, When you were to no one knew what a program was. You could never find an article in the newspaper about computers. Could you imagine? I remember when the New York Times would occasionally have an article about a computer in it. And that would be like once every three months or something. That would be the mention of a computer when I was in college. And now every day because of these concerns and the impact on society, et cetera, a simplistic solutions don't work. So you could think about, oh, we'll just regulate content moderation. What does that mean? Right? It's no easy solutions. And if you, if you regulate things so much, you end up partitioning the Internet between a nation-state and the rest of the world. So if you say that no US institution can do X, it will be done in the Cayman Islands. And then you going to actually put a firewall around the country. So they're fascinating questions. I urge you to consider no easy solutions to this. And I'm optimistic ultimately. So thank you for your attention. You've all stayed and I will show you where the slides are. Yeah, that also mentioned that this was truly a hybrid event. B genes. Okay. And maybe be fine. Can you see the chat? Okay. Got it. If anyone they all weren't comatose from the talk to ask a question, put that message by grip, but they probably can't hear. If anyone has a message out in video land, use my computer. If anyone has a message and video of a question, please. Just chat and we will get to it. We promise. Or it's V. You had a question. You need a microphone to, right? I have two questions. One age. So in this day and age, everybody in the seas to doing data science. In maybe handling data science sentence. Easy to ILO, mostly fake. K. Can you tell us a little bit about it? So the question was, quote everyone in their sister. I think he meant brothers as well. And so I don't think he was meaning to be excluded, exclusionary. Doing data science is did that much to do and are they doing real things? So what I would say here is broadly, there is a coherent study here, right? I think you can create a program around data science where people probably specialize in some aspect of it or another, right? You want to be able to not just understand the broad perspective which I've said, but actually do good statistics or do very good machine learning or very good operations research or Requirements Analysis or whatever it might be your program. So I think these programs are reasonable. My own personal opinion is I would probably specialize with the generality of the overall field which I've presented. I would specialize in some sub-areas and then there's lots to be done to be. Okay. That was that was the first question. Is there a question from someone else and then we'll get back to you for the second. But I'm just trying to practice fairness and my objective functions V. Okay. You can ask yourself. Can you repeat your question? Well, he can't talk. Okay. Okay. Thank you. Oh, my God. Alright. So the question was someone that has is older than than, than, than it's been in the field for awhile. And, and asked me about a project that I did when I was a professor at Carnegie Mellon in the 1980s, which was about, which was I strive to be neither the computer called the Carnegie Mellon low overhead transaction system or kilowatt, which is a fairly good name we think, are where the system was any good you can judge. And we were trying to explore the role of atomicity in distributed computation. What extent was it important to be able to say that a group of actions would either all occur or never occur. And if they all occurred, they would be forever having occurred. They would not be lost, they would be durable. And some of you know, this is the acid properties as this became known. And we do this within a database today. Quite typically, if you're doing transactions on a database like withdrawing money from a bank or something like that. We're booking a reservation. But the thought was, as systems became bigger and bigger and bigger, that would be more different forms of databases, many different abstractions, a much more complicated algorithmic processing. And you would like that kind of guarantee about consistency to be stronger and to extend to everything. So a couple of us, Barbara Liskov, who's a Turing Award winner, had this idea myself for other people I could name as well that we're very interested in this. And, and we pursued this not so much from a data. We didn't know the word data science, we weren't really understanding of how that would play out. We thought of the world as these consistent databases, that of many different forms, cues and directories and relations and hierarchies and things would all have to be maintained. As it turned out. I'm afraid that the idea never really got traction. I would say it got a little bit of traction, but nothing like we thought that it would. It turned out that being able to maintain invariants on large, across large amounts of data was not as important. Or if it was we didn't, the world didn't turn out that way so we could spend hours on it. So it's an interesting question. I guess the question is I would reframe it is what I put so much time into it. If I had thought things would be different and that it might have been a footnote in the world. And I think I think the answer is partially yes, we would have done something different based on that. So Carleton in somewhat expert in this. So Alfred is a role model of mine. We both spend some time on transactions. And I understand that he's very modest, his very modest about his contributions there. But the way I describe to my students the contribution transaction is that it's still core contribution today because it's preserve something called the conservation of money. So business will not work today without conservation of money. And every business, every enterprise knees conservation of money in order to work. And transactions are the only thing that preserves the conservation of money and therefore is a core contribution. And I'll Fred is be very modest about his role in this. Thank you. Let's see. So OK. And then we'll go then we'll go to XVI who's not know, not to be infinitely patient. So, so Salford, my question relates to the plan for failure array. So obviously from the US point of view. But my question is more to the science. Meaning that when you, when you apply data science or when you learn something from theta, can we, can we say, Hey, when a requirement is that you actually understand the limitation of a model because you know what, whatever, and you apply this model to data that she does not have the same distribution by you just say, Hey, look, I'm not, I don't, I don't think I'm going to work. Don't don't trust my prediction. So what are you referring to? So some questions are pending for failure. Do you mean like when a use the result from data science data analysis, I should I should have some mechanism as a halo, the result may not be correct either like say majority voting whatever to medical error. This is from the science point of view when you build a model, you didn't know, hey, that's a limitation of because at the data I'm using, they want to apply the aida model. In the future. That may be, the data may have different distribution and the model itself with the TA, Hey, look, the data doesn't fit my chaining data, so don't trust my result. So, so suppose one is basically you, you, you, you, you, you are prepared at the model would be wrong on things that the model itself is. Say, Hey, look, I could be wrong. So, yeah. So the issue is how do you take into account the likelihood or the possibility of failure? So there, I mean, I think like almost anything we could within the abstraction that we're considering. We could try to prevent it, right? We could do all sorts of things to build ensemble mechanisms, build an error checking, bounds checking. We could say, I have little trust, right? Machine learning systems do have abilities to calibrate their confidence and results sometimes, right, So that confidence intervals right, in statistics. So we can, we can do that. Now. We can also do it at the level of whoever uses that abstract. Which I think is what you're saying is we could say, I don't even believe that I want to do another realm of checking as well. So what, what actually comes out, I think from a software engineering perspective is this, is that traditionally starting with fail-stop computers, right? We believed algorithms either worked or didn't work. Right? And as a security person, you know, we just like, here's something, I'll include it in my program and now my program will be better and it will work. Now what we're doing with software from a security point of view, we're inserting a risk, right? As you were telling me this morning, we're inserting a risk, but also with data science, we may be inserting a potential Lee, invalid result. So what you have to think about is now the implication of that in the program. So if I'm doing something that's playing chess with me, so what right if it loses occasionally, you know, once every 100 years it loses a game because of a mistake. That's not a big deal. If I'm driving a car with it and it's doing the game theory of who's going to go through a stoplight first. And we both go through at the same time. That's a much greater risk. So I think you need to do the risk analysis when they're things that only probabilistically. Okay, SV, We have to go, we might have the deception. Okay, I'm afraid we're alright, so yup, sorry. You'll kill me later, but okay. So look again, I want to thank you all for for sticking with us. Thanks very much for having me. I'll be very interested in any feedback you can give me. Thanks.