Thank you everyone. I'm going to talk to you about the idea of pre training today. Generally in my group, or mostly most of the work that we've been doing the last three to five years has focused on this idea of reasoning and control for embodied systems. The approach I take is the idea of building structured representations and priors. I argue it requires structure representations. Figuring out what data is needed for discovering these representations, of being the people in robotics. Robotics, we are, we want to deploy these systems, we have to think about transfer and safety. But today, let's talk about one thing in particular which is scaling. Scaling seems to be the elephant in every room in robotics these days. A very easy way to put it is large scale data sets have led to accelerated progress pretty much in sister fields in NLP and Vision. This effectively has led to this idea of foundation models, particularly in what I would call agents, LLM, agents. Basically, these systems that can make decisions in a sequential manner. Now, these things have of course, been applied to, or have already been explored in the context of games. At the same time, one could argue that large scale data, a simulation in robotics, has helped a lot in the sense of problems such as dexterity, locomotion, in some problems in manipulation. We have seen that large scale simulation enabled certain kind of skills that were we deemed very challenging, if not impossible. And now are actually, surprisingly for many of us, are completely solvent simulation and deployed in real systems. Now we've seen this idea of unreasonable effectiveness of data, especially if the data is of the right kind. In language and vision, the data is of the right kind just by getting it from the internet. But for what is, that's actually not true. All of us are for priors. But the question is what data would work? How are we going to get this data? I argued that we should approach this problem both from the perspective of algorithms and systems. Well, robotics has been arguing the science and systems for a while, but I think we really need to think about this more formally. I could argue, or at least I would argue that this idea of algorithmic frameworks would be the foundation models for priors. We still need mechanisms of data engines, whether it's real data or simulated data. And we can have debates all day long, which kind of data would work? We need the frameworks that are scalable data collection frameworks. Today's talk, because we only have 15 minutes, I'll focus on some of the recent works that we have done on the frameworks for building priors for priors in robotics, particularly in manipulation, I would argue we need perception priors, things like what objects are, what should the concepts be Priors for skills, then behaviors that we can use to compose using reciting. Let's talk about perception. One of the problems we're looking at was this idea of what we call functional actions. Functional affordances way to think about this. If you're pouring from, let's say, a jug to a cup, what is the concept of that process? Does it matter if the jug is of a particular shape? Does it matter if you do it from the left hand or right hand? Does it matter what cup you're pouring into? That is the problem we're studying. We argued that it should, in principle, have a mechanism to capture the concept of this notion of pouring, pouring, of course, being as an example, that should enable you to generalize this idea beyond any pair of objects or an, any embodiment. In a sense, it doesn't matter if you use dexterous hand or a two finger grip as an example. In pouring, you could really think of poring as aligning certain parts of objects, the pouring object to the receptacle, making sure that the receptacle is facing up. Otherwise, it wouldn't be pouring. And then the part that you're pouring from needs to be tilted so that there can be an object to be poured from. However, the thing to note is we are not given these segmentations on objects ahead of time because these are action specific for stuff that you're pouring from would have certain kind of object based. Representation. In this case, what we did was we provide full three D object shapes, but we recover the part based segmentations, or part based affordances with respect to particular affordances in unsupported effectively. What this results in our ability to learn these simulated part based affordances and then put them again in the real world. In this case, we are able to transfer primarily because we are using point cloud as input modalities. Moving on to skills. Another problem that often happens is the ability to reason without structure ahead of time. If I show you this object and I ask you what are the things we can do with this object, Many of you might not say or might say that, well, most of the things to interact with this object are basically in a very small few dimensions, right? You can open and close a few drawers. Maybe open and close the shelf, the door. If you think about this, the latent space of this object is actually only 12 dimensions. The number of interaction dimensions are 12. However, if you were to think about this in policy space or ask a robot to learn a policy to do this, that would be very big. Rather, let's say, wasteful for a robot to learn this online. Now of course, we want to do this for arbitrary set of objects. What we've done is basically learn an unsupervised model, which can interact with a lot of these objects ahead of time. And build a prior to figure out how it can interact with objects. Notice that this is not built for a particular goal. This is just figuring out if I look at an object, I can figure out what are the most sensible dimensions of interaction. Of course, now you can take this model and use it for goal reaching. For example, if I tell you now that open this object, then it can figure out what actions would result in the interaction with an object that would result in achieving a goal goal being open, close, or so. In this case, we were using image based goals, but you can of course do text based goals. In some prior work, we've already shown that once you have these goals from the environment, you can use, not learning, but optimal control based models. In environments where you don't need tracking, you can actually open, close, arbitrary, uh, articulated objects without having any three D models to do this. Now, all of this is fine, but this is still skills, one step interactions. A lot of it that we need to do is reasoning, which requires compositional planning, often planning and replan. This is where we have been very excited about using large scale language models. The whole idea is language models can effectively tell robots what to do through program synthesis. You can use language models and output plans, but planning in English doesn't always work for a variety of reasons. First of all, Elms do not have ice. They're not situated in the scenes. They might be using unavailable actions and objects. They are not always aware of the world. If you just output text to robot, action mapping may not always exist, then action space in English may be too big. While the robots are just using simple API's, instead of using just English as output. The idea was that we can use programs as output. You can prompt the robot. You are basically teaching the robot with very few examples what to do. Then you can ask the system to generate a plan where the plan effectively looks like a program. If it's a program, you can pass it through an interpreter. The interpreter tells you if the preconditions are met, then it needs to go to the environment, which in this case would be the robot. If they are not met, then you pass it back to your robot, pass it back to your language model to fix the plan. This was pretty good actually. We were able to do simple things for sure following instructions. The important thing was, even though these tasks look very simple, we do not have any prior structure on these tasks. We are not using particular objects. All of this is nearly zero shot. There's nothing specific about this environment except of course, the perception. Now extending this idea, there are certain problems plants generated for robots were not always executable, especially out of the box. If you want to do more complex things like put robots in chemistry labs, often the plan would just not work. Another problem with LMS, as with robotics, we want to use particular DSLs in chemistry. There's a particular DSL in biology when people are using says it's a slightly different DSL and so on and so forth. We want to make sure that they explicitly adhere to that programmatic language. For that work, we created what we call Clarify, which is basically adding a verifier which fixes the output plan and ensures that the output plan is passable and executable. This allows us to do slightly more interesting things like adding baking soda to red cabbage. This is eighth grade chemistry, if you will. But we can do this with real robots completely in the real world without providing a lot of, let's say, demonstrations or structure of the task. You can, of course, make lemonade by providing very little instruction. And you can do this in a chemistry lab scale. Now, all of this was fine, but one problem that we ran into was this idea that this was still blind without having visual feedback. We have recently tried to fix this problem. What is the problem? First of all, right, this is a problem, that task in motion planning, people know this. If the task was open the drawer and put this cube in the system. Right now, if you observe closely, the drawer is actually not open, but an LM would know that. How would it? That is the problem we wanted to fix. This is one of those things where an LLM will give you an output. You can fix it by providing a simple plan. We created a system which is basically a sequence, let's say a combination of an LLM and a VLM, which is basically a language model and a vision language model along with, let's say, verifiers thrown in to do the tasks. What can we do? Think of this particular task. A vanilla LM would basically say, pick up the yellow cube, open the cabinet, and place the cube inside. First of all, this plan is wrong because if you have picked up the cube, you cannot open the cabinet, then the system should be able to say why is it wrong? Then in this case, VLM explicitly says the block is in the way it plans again, and basically says the new plan is, move the red block out of the way, open the cabinet, and then put the cube inside. Notice that this time around, the feedback was not generated by a blind LM, but VLM provided a specific instruction that enables you to plan correctly. This was one example. Now this is another example where you are effectively doing search. It is looking for an object which could be in any of the containers. In this case it says, open the cabinet, look inside the cabinet, find an object it tries to, doesn't find it, then it can continue looking in different places it doesn't find, then it goes to the microwave. In this case, the microwave is not openable, so it has to move the object out of the way. In terms of tasks, we found that simple tasks where motion planning is easy, you see a lot of success with simple methods, like language to rewards directly. This is something that came out, I think, August this year. A lot of people have already built on it. However, you make the task harder, make the model do search. This is something that actually got a lot of attention last week on Twitter. If you were following Twitter, that LM cannot do that is actually surprisingly true. L cannot do search at least the way they are designed. There's no notion of online computation built in. How would they? But you can make them do search in a way if the problem is structured correctly. In this case, we showed you that if you're looking for something without me telling you where it is, looking for an object in a bunch of containers, or looking for a fork in the kitchen or something like that. You can do that using semantics of the V. And we find that, first of all, that replanning is necessary for adaptation. Replanning helps visual feedback matters. We need multimodal models. Checking your work helps checking your work in LLM world is verification. Basically asking the LM to correct the code it has predicted. What I talked about today was Pers where pliers can be built purely without task, which enables embodiment transfers. You can build priors for skills such as opening and closing doors or any arbitrary articulated object. Then the idea of program synthesis using NLms in BLM's as a powerful prior for reasoning. Of course I only talked about three aspects of it and I truly believe that the systems aspects of it are equally important and required for the scala. I talked about a few papers. Primarily, I think the language guided program synthesis work that is still in review with that, Thank you. Happy, take questions, stuff. I do have a question about scale because you haven't noticed, made the underlying issue of this session scale. What's the scale that you need to Thousands, hundreds of thousands depends. We found that particular scale scale of diversity is not, for example, people have been grasping models even 100. But for reasoning scalesity of scenes in structured, lot of that diversity can be generated fortmaticlyc. I do not believe pure real world collection would be the pathway to that scale. Or at least your real world corrections only would be the pathway to that scale. Because of what we've seen, the kind of diversity and scale needed may not be achievable. That yes, but yes, the answer to that question would be, we're looking at billions, billions of times steps, billions of types of problems. While that means that number may sound exaggeratingly large, if you are able to do this procedurally with variations, it's actually not as much. And we don't have to do this for every problem of the thanks as much.