Thank you for staying. It's hard act to follow Charlie. There's a wonderful talk. So I'm not an analysis. I am now getting into robotics actually do have papers in robotics conferences. But all of your, nearly all of you have more robotics experience than I do. So a couple of disclaimers. You can dismiss me out of hand if you'd like to. I am being deliberately somewhat provocative in this stock. It's partially to set context for my work. Partially because now I have a reputation to uphold. I've done this. I did this a couple years ago. I had just gotten tenure and I thought why not? So I tried it once and now it seems to have stuck. And in some sense, isn't this fun legs is a safe sort of audience with friends? Isn't this? What being back is all about? And in a way, it's better to be wrong than not even wrong. So at least I can, I can put a stake in the ground and be at least wrong a few years from now, then play it safe. Like I don't want to give another CVP our talk, which is boring. So let's, let's try something. So first of all, I say robotics is too important to be left with roboticists. What I actually mean is not quite robotics. What I mean is embodied AI. So I'm interested in, in AI and I'm interested in intelligent systems that are embodied in a physical body, but I'm also open to virtual bodies. So AI is the science and engineering of intelligent machines. Embody the eye is the science and engineering of intelligent machines with a virtual or physical embodiment. I'm thinking not only robots, but also smart glosses. Intelligent assistance. Where the problems you have to solve an egocentric sensing and perception, you have to solve some sort of persistent memory and awareness and instruction following and acting and environments. And I'm drawing a distinction from disembodied AI, from static machine learning tasks on datasets, if you will. So that's the distinction I'm drawing. And I'd like to structure my talk in three steps. I'd like to put down what are the postulates or axioms? These are the statements that I'm going to start by assuming are true for the purposes of this talk. And the reason why I say that is because in robotics audience I actually get challenge on these statements. So what I'm saying is hold off these challenges till the end of the talk. These let's just assume are true. And it's where the hypothesis where I welcome all challenges, hypotheses are meant to be falsified. Axioms are not like axioms or idiomatically true. And then hopefully if we make it past that, we'll talk about, you know, if you agree, if you're still with me, here's what I think is what the implication of that is. So what are these self-evident truths that we must begin with? Here is my self-evident view of how we can get to intelligence systems. Number one, axiom number one, benchmarking is good. And this is something that actually do get resistance and the robotics community. The answer, in my view, too bad benchmarks is better benchmarks, small benchmarks as larger benchmarks. The answer to static benchmarks is dynamic benchmarks dancer do singletons mentioned multitask networks? The answer is always more benchmarks. There is always the answer. That's an axiom. It is idiomatically true. Computer vision in particular has a long history of datasets, benchmarks, common task frameworks. The machine learning community now has a actual track. New Europe's is the main venue. And if you go to New York's 2021, there is a track on datasets and benchmarks where if you're proposing a new task and you benchmark your especially welcome and encourage them. And you can go and scroll this page where people are creating all sorts of interesting new things. So that's Exhibit number 1. We agree benchmarking is good and I apologize if you don't internally, you'd having a struggle about but we'd all of the problems I can't actually take my reward and create reproducible, but let's hold off. On exam number 2. Reproducibility is good. If a result cannot in principle be independently reproduced, it may as well not exist. The robotics community in general does actually have this problem where you have claims like my advisor's room, its cousin showed me a video of juggling two balls in 1980. So this problem is considered sold. Statements like these have no place in science. These, this shouldn't be permanently banished. And we're going to axiomatically agree that reproducibility is good if your result cannot be reproduced. I agree that there are logistical, financial, other considerations, but if in principle it cannot be reproduced, it may as well not exist. And I would state that is the reproducibility crisis in robotics you. As a field, this does not have the same sort of standard benchmarks tasks, the common task frameworks. And if you don't trust me, trust your own colleagues. So here is a paper on archive those and associated talk at RSS last year of what robotics research can learn from computer vision research. From robotics researchers where they are actively exploring. The main pieces that they were trying to get to is reproducible benchmarks plus competition equals progress. And this is something that we've seen in other parts of AI consistently. My lab together with David xlab, we're on a visual question answering challenge every year. On the x-axis is time, on the y axis is performance. This is a standard dataset. The details of the task and the metrics aren't particularly important. What's important is there's an annual competition that we've been running for about five years now. And there's click, time goes right, performance goes up. This is just law of nature that we can, that we can establish if you have a dataset and you give people a common task and a competition, progress happens. And this is what's, what's happening now in the embody the AI community as well. So there's an annual workshop that we've been organizing for three years at CVP hours on embody the I might as well call it robotics at CVP out, if you will. Where we host annual competitions on navigation problems because navigation forced made headway into, into division community a couple of different variants of the problem. The details, again, aren't particularly important. The stats on the last three years that we've been doing this are clear. The number of teams participating in our competitions is going up. The corresponding performance metrics on those tasks have also been going up. I'm ready to declare this just a law of nature. And my colleagues at Stanford have even taken seemed to real trial. So imagine people sitting in, do not own a robot. They submit their code, it gets evaluated both in simulation and on a real robot in Airbnb rented houses. So that's the state of affairs that's happening at Vision conferences. These are kinds of competitions that are happening. I also want to mention there was a recent effort that we're involved in that we put out a tech report on the on mobile manipulation on object rearrangement as a standardized benchmark. These could be static robots operating entirely from vision, but there's also mobile aspects to this, where you pick and place tasks or other such tasks. And here are the people. Some of them are roboticists, most of them are not very involved in this. We put together a report on this. We actually did send it out to a wide group of researchers, primarily goddesses, but not just robotics folks, who received a good set of replies from these people. Frank. Frank was on that list. Here's, here's color-coded positive or negative comments that people sent us. An interestingly, we found that. So we were pretty harsh on unless people explicitly said that you're doing the right thing, we actually made them neutral. So the 14 positive comments are, are significantly positive. And some of the positive things that we received were were nice. I won't go over those. Yes, You're doing a good job. The sharpest criticisms I think are illustrative because they tell us about the pulse of the, of the robotics community. The sharpest criticisms that we received word, you folks are not real Remote Assist. You should add to the team. Benchmarks or a fool's errand, they lead to highly engineered solutions without contributing to deep understanding. It is also fair to say that many rather elaborate and thoughtful attempts are defined benchmarks have had little impact. This community doesn't seem to agree that its own efforts. I thought why Cb was a success. I'm not even part of your community. You don't think of ICB as a success. What you call embody the I is in fact robotics, you should call things by their proper name. And simulation is a waste of time. So this, this was, I think it reconcilable differences. We decided to do not address them in our report and move on. My internal response, but I did not provide to people was to point at the list of top when news. So these are top 100 venues according to citation counts and impact factors. And I will point out that computer vision sits at number 4, 27, and 31. Machine learning is around 101223. Robotics does not appear in the top 100. That's vision is easy. We are sibling fields from the same parent field. I think it's time to admit that something might be happening, right versus wrong. And some of those feeds, are you still going to protect? Okay. So those are the postulates. Benchmarking is good, reproducibility is good. Here are hypotheses. This is where I think I'm perfectly open to admitting I can be wrong. We'd love to find out where we go wrong here. So if we have to get to intelligence systems that operate in the real world, here is what I think are certain hypotheses that we can pursue. Number one, I think intelligence is, is an emergent phenomenon. What is emergence? I really like this definition from David Deutsch was a physicist. It's this idea of low-level complexity giving rise to simplicity at a high level of abstraction. This happens in science across the board, where interaction between particles in physics gives rise to molecules and chemistry and the study of, of, of, of chemical substances, which gives rise to biology, which gives rise to psychology, and create colonies, people create organizations, brains create minds. So the idea that intelligence is an emergent phenomena, and this is I think the least sort of contentious of the, of the hypotheses. The one that perhaps we in this field might actually agree with is the embodiment hypothesis that the idea that intelligence emerges primarily from the, the interactions of an agent with an environment and a result of sensory motor activity. You create sophisticated environments and you give intelligent agents the ability to maximize reward. What you will see is the emergence of intelligent phenomenon. The, the notion of, of this idea of an agent and environment from ADL is actually a really powerful idea. I think that the techniques aside just abstraction is a powerful abstraction. And my claim would be, if you choose a poor environment, you will get narrow types of intelligence systems. If we choose go, Then all we get our intelligent systems that can play go. If we choose the text that's written online by humans, what we will get our GPT-3 like systems that can autocomplete text, but not much else. If we choose all sorts of source code that's available on GitHub, what we will get is copilot and, and codecs type autocomplete systems that can go from natural language to do programming. The most interesting part of robotics is not the, not the algorithms is not the planning. It's the fact that your environment is the most interesting one. You're dealing with. Laws of physics, you're dealing with 3D, you are dealing with perception. You're dealing with other regions that are present in the environment. You're worried that you're dealing with is the most interesting bit. And I think we should remember that. That's what makes the, the emergence of those intelligent systems perhaps the most powerful ones. Hypothesis number three is the, what is generally known as the scaling hypothesis. Once we find a suitable general-purpose architecture, whether it's self-attention, whether it's convolution, something, a general-purpose system. We can train ever larger models and get an evermore sophisticated behavior will emerge. As long as your environment is, is challenging enough, is rich enough. You know, in a nutshell, scaling is all you need. Bigger brains will give you more intelligent agents. And if you want to see an actual example of this put into language modeling, I highly encourage you to check out this paper coil scaling laws for language modeling by OpenAI. It's just a fascinating read where they've taken an empirical science WE of things and they've come up with laws of how as compute varies the test accuracy, test losses go down. And it's a beautiful study. So we each take a look at that. Okay, So stepping back, emergent hypothesis, embodiment hypothesis, skilling hypothesis, that's what I'm presenting as things where it's certainly possible I'm wrong, but here's where I think some of the hypotheses exist. So where does that leave us? Like, what should we do? My proposal that I think comes as a corollary of accepting what has come before is we have to build extremely large scale, extremely fast simulation. We have to be in a position where we can simulate 20 years of experience plus learning of an agent interacting with an environment that is rich and sophisticated in under 20 minutes of wall clock time. If you do that, then you can scale your learning systems and you can set up an outer loop, which is architecture search, model selection, whatever, wherever essentially rerunning evolution. And I think that when we'll see the emergence of, of intelligent systems, this is also reproducible. We have to create benchmarks, we have to create common tasks. We have to give 3D environments, virtual robots tasks, and share that so that people can easily reproduce some of these things. Pursue seem to real transfer. And I see the role of a lot of the work that the robotics community does as grounding the environment building. Because if you build a wrong environment, you will end up with intelligent systems that are perhaps not all that interesting for the physical wounds. And I'm a believer in the 80, 20 principle. You know, large parts of the gains like approximation is good enough. A lot of a lot of models are useful. It's fine. Perhaps just rigid-body simulation is fine. You don't need the vulnerable, you don't need fluids, you don't need gases. A whole bunch of things don't need to be simulated. So for the last at this 0.4 years, my group pure and in collaboration with Facebook AI Research, we've been working on a large scale simulation project called Habitat. Initially, we started out studying navigation problems. More recently. We've been doing mobile manipulation problems, where we study tasks like there's a fetch robot in an indoor environment that are objects that are articulated, cabinets and things. This reward, solving just a rearrangement problem. Think of it as a tiding the house problem where you're picking objects, moving them from one place to another. This is sort of like cleaning the house setup. You can think of preparing grocery setup where you can go pick up objects from a refrigerator on an articulated object, put it somewhere else. You can think of a set tabled, R-squared and multicolor. The object has to be open. Drawers, put it on a table, close the drawer, so on. So there's, there's rigid body dynamics. There are no deformable objects, they are not liquids. There are no flimsy films, there are no close, there's no ropes, There's no physical transformations. It's just rigid body physics plus photorealistic rendering on large-scale environments. And continuing with the theme that we've presented, I think fast simulation is, is absolutely essential. So compared to other simulation efforts on the market, which are typically around 30 to 60 steps of simulation per second. Habitat on a single GPU runs at around 8000 steps per second. And on a GPU node it runs over 25000 steps per second, which is about a 850 x real time. So you can simulate 1 second off real-world experience and 1, 1 150th of time. And I'm comparing 30 steps per second as, as real-time. So this is what lets us scale some of the problems that we're studying to, to a degree that we haven't been able to before. In navigation literature, we have seen what the effect of this scaling is, that you can actually get near perfect performance in new environments. We have not yet done those sorts of things in the mobile manipulation literature, and we have not yet done cemetery and studies and mobile manipulation literature. Okay, so I think that's what I will. I will stop here and I'm happy to take questions, but I'll leave you mostly with a food for thought that I think these two should be universal consensus. We should have universal consensus on us, but we don't, these, I don't know, I'm happy to be wrong. They seem right to me. We can design experiments, we can come up with ways of proving them wrong. This is what I'm pursuing. Fast simulation. I think it's an exciting field. The net effect is what I'd like to see is what happened to other fields where there was a low barrier for entry. What you want is people to be able to enter a field without spending a $100 thousand or even $2 thousand. People just download a piece of software from the web, make small edits, submit a piece of code, and it actually makes it meaningful improvement to robotics research. And so that would be nice. What I'm pursuing. Thank you. Thank you. For five years down the road. Here. I know not every meeting room and move. I started out by now. So I guess if I rephrase what you're saying is what halves where I will find the light is due to. The emergent thing is that this Dodi is that all right, a year for 10, 20 years. There's no other way. Right and wrong. Yeah, by thought, if you're going to build a system to solve a problem, you're going to find number of solutions. And if you want to hand design, like, we don't solve this. This is what I say my deep learning class, right? We don't solve sorting by with machine learning. We don't build a learning algorithm to teach and learning system. Here's what the right sorting of numbers looks like and here's what the wrong sorting of numbers looks like. So yes, of course, well-designed understood algorithms for certain problems are the right solution. Once you understand the problem sufficiently, I think if you look at how we keep defining intelligence, we keep defining it in ways that prevent us from coming up with those solutions. What the history of AI has been, we come up with chest, come off it go. People will create systems that will solve those problems and then we redefine it out of AI. If we keep defining intelligence as the ability to do new things, then it's unclear how you can come up with well-specified engineering solutions. There are no new ideas, Yes, I might see the light in a few years. Maybe. You're married, they are probably your mother, my mind. Or real Reader. Binding the other way or that way. Roll it back out. But actually, would you mind just Can I just want to the first one? Okay. I think that that's absolutely right. I think of other agents as part of creating the rich environment that is necessary for the four stage and to learn. But we've seen the success of self play. We don't need a article agent or something that significantly smarter than you. If you'll look at, for example, the AlphaGo work or some other works that have done it in a even 3D simulation. As you get better. Although the agents and the environment also get better, and that seems to work. So does that sound more matter? What you're doing with the world model? In order to fully human. It's not really that useful thing. That Robot depended on. The second day. Or robotics or equal my original Roman world. And you're not being in the real world and you go along, well, I think it would have been having that for your reader. You might not get there when your robot. But I'm excited. Thanks Sally. I think those are those are good points. I agree. I agree with nearly everything that you said. And you're right. We we do try to do seem to real studies. I didn't talk about them in this particular talk. And that's exactly for this reason that you want roughly that your findings that you're finding and simulation to be aligned with what you find when you run real-world studies. I don't think you need to have as high a BAD as 0 short cemetery or transfer your simulations never going to be perfect, but you can pre-train and simulation, I think of it almost as the same way in deep learning and machine learning, but we pre-train on one dataset fine tune or another. You're going to be pre-train in one environment, find you or another. But the important thing, there'll be some grounding so that you're not complicated. I would not want to create the m-nest stuff 3D simulations like that's not that would be a failure if that's what we ended up. Thank you. Thank you and me. Okay. So add notice him first. Yeah. So the assumption that we're making is I think, I think of it as, do you want to maximize speed subject to some realism concern constraints, or do you want to maximize realism subject to certain speed constraints? And I'm in the former camp, I'm okay with first defining like the navigation work we did had no physics. It had video game physics where there's a virtual cylinder where you don't actually bump into anything you've sort of pre charted out where you can go and you just do that. Now we actually have rigid body dynamics. It's done with Bullet Physics engine. We're not actually creating our own physics engine. The reason why we were, we're operating in this order is because we think of like speed as the primary sort of goal and citizen. We want to build systems that we can learn from and then we'll keep expanding this frontier. You're absolutely right that there's no way we're training robots that will be able to pick up bottles that were partially compress or pour-over fluids. That's just not going to happen in the kinds of simulations that we're building. But I, I think of this as there's a lot of physical understanding, egocentric perception, task and motion planning that can be four studied in this domain with, before you get to the default mumbles and the ropes and the fluids and those sorts of things. This reward me by my environment or, or, or, or. So what we're actually doing in practice is setting up tasks and simulation, navigation tasks, mobile manipulation tasks where As experiment designers we know exactly what the right answer is. We can give you partial progress reward, we can give penalty for collisions, we can set all sorts of that. I agree with you. There is a deeper question that you're trying to get to. What is the right to avoid? And there is a problem in this field which is reward engineering the behaviors that emerge depends very critically and sensitively on what kind of rewards you write down. I don't have the right answer for that. I think a mix of imitation learning and things like that will help us get there. It's a rich field. You know, I'm not the only one talking about this. If you want to get philosophical biological systems, there's only one reward survival. If you survive, you replicate. And whatever behavior it helps you replicate, That's the behavior that, that gets propagated to warheads revenue. So right now we're held back by forward prop and back prop on the networks that we can get. Habitat is already doing 25,008 GPU. I can't actually do training with resonate plus the architectures that I'm writing down at 25 thousand, but we're working on it. So right now the bottleneck is learning. Once. This is a constant knocking down of pegs that are holding you back once we get that faster, which supervised learning community is also working on, the bottleneck will be back again simulation and we'll, we're not held back by rendering right now. We've written our own render from scratch. It's fast enough. Rigid body physics is what's holding us back and we'll be a problem in the future. So we'll have to work on that as well. I hope so. I don't imagine that we'll do zero-shot same material transfer. But I imagine, you know, just like, just like we do, dataset to datas are transferred but there's pre-trained on ImageNet. There's an equivalent of that, but that will come up with in the embodied setting. And then maybe you do large-scale training and simulation. You're fine tuning certain parts. I also perfectly okay if there are plug-and-play modules in your component that maybe the CNN is plugged and played out, but the rest of the RNA survives and so on. I think there's a lot of open researcher to be done. I've seen enough evidence in sort of navigation literature to suggest that the 800 sorts them to reel in certain cases can work. But if not, just a little bit of domain adaptation with gans goes a long way. We have not done any of that in the mobile manipulation literature. So I, I can't, I can't comment on that. But I imagined the problems there will be physical realism gap, not the visual realism, yeah.