Thanks very much want to be here could I just have a shot quick show of hands who does robotics. Here in this room thousand vision. Much less than I would have thought Interesting all right. In three parts first part is just a motivation about. Some opinions about these things but you can argue with me about. Just some research results from my lab. It's almost entirely the work of my students and my postdocs and I'm blessed to have wonderful postdocs now lost if I get time it's such an application vision putting robotics vision together things in the world so let me off. With woman vision is hard for robots so there is this notion called It's been called more of X. paradox and it's things that hard for people like playing chess is stupidly easy for computers these days I mean we beat the world the great world grandmaster it just nine hundred ninety seven you know twenty years ago or more thirty years ago yet to pick up a chess piece from a board is still ridiculously hard for a computer or robot so a small child could do that Rick ask them to pick up the white crane and they could reach in pick it up and down the other pieces that is Spiro a very hard problem for Robotics and and computer vision which disappointed my click is not working on Spent over here. And even if we could engineer a solution for a robot and a vision system in a particular chessboard we would really struggle to generalize that to any put to any other sorts of chess boards and particularly a to Spore that was made out of transparent shiny objects right we don't talk about transparent shiny reflective objects in computer vision it's a whole thing which we have a blind spot towards which is really sad. But we use all of us vision for tons and tons of things it's very difficult to imagine what life is like without vision robots lack what I would call visual intelligence something that needs to change is interesting video it's a P.R. one tidying up a room I think it was done at Stanford a number of years ago the robots tell you operate it writes It's got relatively crude and effectors and it's got some cameras and somebody in a back room looking at the video feed and moving joysticks to control the robot and the hands and in effect is these robots doing a reasonable job of tying up a room this tells me that the problem isn't cameras and the problem isn't mobility and the problem isn't and grippers The problem is the thing that glues that together the joints the perception to the action. And that's the intelligence guys it's a it's a grad student probably in the back room doing this work right that's what we can emulate and that's the that's the difficulty now picture of mountains. Mark Two points I and B. which once close up I say you all know it's a rut this is a flat screen we're going to be the same distance away but somehow baked into how the visual system is knowledge about how mountains work right and you know that I must be closer than be in the whole ton of tricks that we use to actually determine that this is one of the things that the knob of problems in robotics robotics and vision particular if you use only a single camera So imagine for now that that really is a three dimensional mountain range we point a camera at it we take a picture. Of the cameras during a projection from a three dimensional world X. Y. Z. coordinate careful not to say said into it into two coordinates accordance on the plane image plane on the photographic paper whatever right so we've crunched three dimensions into two that's what a camera dots right it's kind of unhelpful because if your robot is really useful to know about three dimensional space and I want to have a collision free path from here to the door knowing about space knowing how far away things are for me is critically important. And we lose that in a projection and so all sorts of crazy things happen like power lines appear to converge and we take that as normal we see converging lines and we think are they must be parallel unpicking that in our brain we see these Ferris wheel and we think it must be a circle because we've seen ferris wheels before we got some to my knowledge right and we know that if we look at circles obliquely they appear to be ellipses So this is all natural to us but it's a result of seeing the world as a two dimensional projection of its three dimensional self leads to all sorts of other weird things we can do tricks. And if we use the common sort of pinhole model that we use to represent this we write the math for our. Perspective projection we have an image of a dock project that's ripping how camera we can in. Good image of a doc but the issue is that all of the docs cost exactly the same image right this is a problem with is the mention ality reduction from three dimensions to two we lose stuff we lose the the size and the distance of objects have become in extra complete connected and that's problematic and it leads to all sorts of interesting visual illusions now our visual system uses about eight tricks to disambiguate this it's not just binoculars stereo it's important but we use feedback from the muscles that control the lanes in our eyes versions of our eyes information about texture about apparent height of objects about a clue to tell us how far away things are we use all of these tricks and the Whiting on these tricks depends a lot on the distance from us things that are in our personal space we use some tricks you know middle distance from a distance we use different sets of tricks we've evolved these a really long period of time because it's got survival advantage it's useful for us to be able to reconstruct the three dimensionality of the world given the fact that we've also evolved sensors that crunch three dimensions into two we've evolved a visual system that can put those dimensions back again I think that's that's quite fascinating So why is vision hard for robots reason number one is because appearance and your metry an already equal right so this is not a door this is a picture of a door and if you're a robot this is very different affordance is right so you need to be able to tell a picture the difference between a picture of a door and a door and this is not a hole in the sidewalk that you should avoid if you're a robot it's a trick done by a truck artist right so we have to this in tangle appearance from geometry geometry is what robots need for planning appearances want to camera since it gives you. Another one is that reason the vision is hard for robots is a good. Be the same place and place is an important construct for a robot robot is to know that I'm in a place of at the front of a lecture hole in my apartment somewhere else now the appearance of a place can vary on the recent examples so these are the same place and in this first column What's changed is the lighting is gone from day to night and it's raining so the atmospheric conditions are adverse The next case is just a lining variation the sun's in a different part of the sky now we've got shadows across the road it looks quite different right and if you apply a feature detector to this it's going to go crazy looking at all the edges and corners which are the shadows on the ground and shadows not meaningful eye they don't exist in the it doesn't change the geometry of the place at all last ones Interesting up what's changed is in fact the material that the scene is made of so used to be made of dirt and grass now it's made mostly of snow so the material there is different but the same place right and the lighting is changed as well. So this is all very confounding to to a robot so I use this kind of photo equation to kind of stress the complexity of image formation so the image is a function a very complicated function of the geometry of the scene that's where things are what stops made all of where I am and which direction on looking as well as where the lighting system with the lights are and the atmospheric conditions is raining foggy whatever and then there's noise and distract it's noise in the senses there might be an insect flying past or the sort of exhaustion as things that are really difficult to model now that equation we can simulate and computer graphics people being doing this for seventy years you can trace computer graphics back to the fifty's right so if I know all that stuff where everything is what's made a with a light SA I can simulate what a. Camera would see I can synthesize an image so I think of this is the forward problem given a model of the world I can say what it looks like. Computer vision and robotic vision people with the idiots who want to do the opposite thing right we want to take an image and invert it to back out where everything is what it's made of and where I'm standing right this is high facing possible it's all polished all right there is no inverse for this function yet that's the tops that we have set ourselves so we only way we can solve this is to add some constraints we need assumptions we need world knowledge experience additional information different senses more camera views context experience and anything else you can think of throw that in there to try and come up with with a solution. And I think what's interesting if we look at the progress that's been made in. Vision learning applied to Vision deep neural networks convolution your networks they can take a single image and they can convert that into and get the three dimensional structure from a single image which you did when you looked at the mountain and told me that I is closer than be neural networks can do that too and I think they do it because they kind of get experience by being trained on a bazillion images of the real world model images of possible if you plot every single images in a spice of a huge number of dimensions right in that space of all possible images is not uniformly distributed there's a manifold of physically possible images exist and I think that's what the networks are capturing in some sort of way so experience is really critical and we get experience on just by by living in the world and we know that some things are impossible we know that there are little people who can stand on a finger we know that there are little people but this is a normal sized person at a huge table and we know that people can change in size as they walk across the room right so when we see. These things we stopped it to some mental dissonance happens we reason about it and then we figure out that this room is not like a normal room this is an I in this room and that this is giant furniture and that I'm playing a trick with you right so this is the experience we have about the real world which is critical to unpacking an image to understand what's going on in there so this is the first reason why vision is a difficult sense of for robots to use. And the second reason the third one is that it's still ridiculously difficult to get good images under all circumstances so if you're moving quickly in NY ease a big gap use a big shot of time or big aperture and then you get motion blur or you get the focus. So cameras that have gone a dynamic range like we have a still very very expensive we've got something like a twenty beat dynamic range in our eyes which is phenomenal We don't have twenty big cameras yet. Wall to come and the other issue that makes it hard to do vision for robots is just the Internet into a class variation so there's a cops now all look quite quite different different types of cops some of full time or empty we're looking at them from different different viewpoints but any human could look at this in an instant say it's a coffee cup teacup that's still very difficult for algorithms to do so we know the vision is not just possible we do it all the time it's very useful and is very possible the issue is not cameras what we have in our heads are affectively one hundred twenty megapixel camera. Twenty twenty bit dynamic range three color channels so you can't buy a hundred twenty megapixel cameras easily but it gets boiled down at the back of your retina to about twenty make a pixel image that's what goes to your rub your cortex so cameras really aren't the issue and we've got accelerometers and gyros in our. Hey do which help us resolve stuff is the world moving or am I moving that can be ambiguous so we have accelerometers you know hate help us resolve that the dead shape. It better be in so many devices that we are in the kicker's this spirit is our vision engine it's essentially the back third of your brain is five hundred grams or five hundred grams of great stuff we consume six watts which is impressive we can do cool things with G.P.U. clusters but use a lot more than six watts they use many orders of magnitude more than six watts right so this is what we this is what we got to develop this is the challenge for us as well as robots roboticists or robotic vision people. And you know think I just like to stress here. Is seeing is more than just taking a picture and beating it to death with an algorithm right seeing is a very rich set of things that we do write it involve memory to process a scene I use all my experience of similar scenes to help understand that history strong sense of context certainly a whole class of things I would expect to see in this classroom context that I would. Not expect to see if I was driving my car so I would not expect to see a car or a motorbike or traffic light or gorilla or anything like that so I'm not looking for that I have expectations about what's in the room of Applied constraints to the image and that helps me understand it so experience and knowledge is really helpful in understanding images I use the image to help me establish what context I mean I'm in a classroom and use the context of the classroom to help me understand the imagery so it's an interesting kind of wicked cycle going on here some with action and vision I use my vision to help me reach down and pick something up or use my eyes to guide my hand but also if there's an ambiguity. I'm going to see what you've got on the desk in front of you I'm going to move my body or my head without even thinking about it to resolve an ambiguity you probably know what's behind you you can imagine what's behind you because you kind of fuel it enough but the visual memory is almost as vivid as what you're seeing with your eyes so you really caught a lot of complex stuff going on in this robot assist We don't tap in on top ten percent of all the tricks that we've evolved to understand imagery and I actually think we make life hard for ourselves by not taking this trick not having using attention not using context and so on. And my sermon about why vision is hard for robots want to talk briefly about why I think are important differences between the research communities of vision and robotics and these are two very big research communities and I probably identify with V. read the report it research community and I doubt the robot conferences Very occasionally a Vision conference and I've got colleagues who go to Vision conferences and robot conferences looking for looking at both communities want what we notice is that in the vision community they use a very standard set of benchmarks and and test methodology and they compare their algorithms quite rigorously with other people's algorithm they publish tables of ball numbers and we kind of the ride the guys with the tables of poll numbers but you have to say that their methodology of using. A consistent test set and being able to really quantitatively compay your results with other people's results has led to enormous progress when robotics would make a video. A bit my robot I hacked it not tweaked it and I took a video of the time that it worked and I published that and now I'll get me into it every time right. It's. This is science but this is not right this is an anecdote and this is a story that you told some. Oddie I made a robot network not took a picture of it it is kind of sad that there is this distinction between robotics and vision and we can argue with me later about this but I think there is a flaw with what the vision people do they use a standard de to say image thing and they put an image into a classifier and you get a score did you it was a cat and networks it was a cat you get a point or is the top five things the network suggested number three to number three was a cat you still get a point right depends on your metric. But these images are all not correlated in any way there's no robot is going to get a picture of a dog or cat in a coffee cup that just doesn't happen I'm going to get a picture of the world I'm going to move the bit there's another picture the world this could be mostly similar to the picture of the world that I had before so this is not at all useful for a body. What happens is that the image dataset fills up with images. That are easy to get right so you download tourist picture is of things from around around the world. So you only get a limited number of set of viewpoints and you don't get pictures of unusual circumstances like a forest that's burning because it's difficult to take these pictures dangerous so there's a big bias in the data set that the vision people the vision people are using but this is the kick I think if if I build a robot it's got a camera on it and I put it on the on the ground a particular spot and I push the go button it takes a picture an action takes another picture takes an action I can never repeat that I can never get the initial conditions to sign I can never get the pic that the initial images lined up to a pixel. And to be noise on the sensor anyway so I will never take the same picture initially I mean the first action will be different which means the next picture will be more different and so it goes so I can't replicate a single experiment I can write my robot in some code you could duplicate my raw. Duplicated my code and set it up here and it will behave differently how can we do science if we can't conduct rigorous repeatable experiments. I think that's a problem for us and talk a bit. In the more they tell later about ways that we can overcome that So no I don't do is just talk about some of the research themes in my in my lab a lot of what I'm talking about is the work of work of others one of the oldest research topics in the lab is we call robust place recognition and I've touched on this already that if I met up it's really important for about to know what place it's at but the appearance of a place changes dramatically depending on time of day and where you're looking at it so looking at this the top one is reasonably easy to tell that they are the same the night time so the sky is dark but some things that would dock in the day a bright at night and vice versa this one the viewpoint is quite different we're much further away from the buildings on the right hand side but it is this is the same place the lighting conditions of change so my colleagues like no food and a lot of work on a technique he calls sequence slam. And this is an example of sequence land so what we have is two image sequences driving on the road in the daytime and nighttime and we can figure out from a from a particular frame in the daytime sequence the current spawning frame in the night time sequence and we can just if at least leave Blaine backwards and forwards between these two sequences and it's pretty it's pretty robust the algorithm is very simplistic but it's very very robust. And I'll talk now briefly about how that that works what's the kind of the trick here and what it's doing is whole frame matching So it's taking the whole frame and setting the date on the whole frame at the night time reducing the resolution to enhancing the contrast a lot and it's doing a simple assist the image similarity measure up and that's not a very robust similarity measure so. What he does is look at the sequence of of similarities over time so far and particular point in the sequence and I look back over the last few seconds that should look similar that slab of data should look similar to a similar link sequence taken it in another time of day so he is big my tricks of the query frames the frames that are new frames vs the frames that I've seen before this matrix contains the similarities are all of these and the key is that where it's stock it means that the query frame matches a friend that I've seen before and this group means that there's a sequence of ones that are the same so that helps me disambiguate this one here is an outlier. It's probably not a sensible match but this doc slot here is a sensible match so we use temporal information as well so I will say that's the. Contrast That's the contrast enhanced to me and you also do we do other tricks like shift the images left and right and we also change the scale of the images to try and find the best possible the best possible match so once we've done that then we can start to match human sequences not only from daytime to night time but also different scales so if I'm looking out the left hand side of my car and I mean one line and then I go across three lines everything shrunk down a little bit but by doing this set of computing image of the different scales I can find the scales at which these two sequences assume a lot so we're matching over quite a big search space now what changing the scales or changing offsets. But computation it's actually quite tractable So one three police kind of stance than I can do another video like the one I showed you earlier but now we're going not only from daytime to night time we're going backwards and forwards along the same road and the scale is changing and again we can quite effortlessly so transition between these two image sequences. We can take that a step further and if we're in an environment where there's a branch road network and so I'm moving along and I come to an intersection and I know that at the end of the intersection I can't be in all possible places I can only go down one of N. possible roads so I can use a particle feel the framework to help me guide this search to make it more directed more efficient and work out where I am with respect to imagery that I've seen before and we can extend this then to panoramic images you've got more pixels to look at. Lots of places G.P.S. is not going to work well for you so in environments I like an urban canyon so you go big driving through New York you're not going to see many satellites so it's not going to work during the tunnel. You're in a forest and there leaves are the head so the there's lots of environments in which G.P.S. does not work very well in a mining environment in a deep hole in the ground again you can't see many satellites so G.P.S. isn't as robust as many people think it is it's a it's a great sense of when it works well but there are lots of places where it does not work well. So. Another thing that we can do is use so most sophisticated approach to place recognition is using the learning to recognize particular features in this case the network is sort of recognizing particular buildings landmarks structures which are invariant in the scene and using those to help it figure out where it is and whether it's at the same place or not you know semantic classification is something that's come a long way in recent years or I can take an image and I can put it through semantic classifier and label pixel by pixel what's there is a ground tree person building what I have up and this is really rich because if I know that a pixel is the ground that's really not very useful for localization ground is everywhere so we skies skies everywhere not useful for localization a car is not useful for localization because they move and for localization I want landmarks on things that are invariant same with people. Buildings or trees the good landmarks are by be able to look at an image and classify what things are there then I can reason about them and say this would be a good landmark is going to help me decide. Robustly where I am. Another thing that we can do with semantic classification is make some inference about how things are going to change their appearance so if I know that something's a street lamp. I know it's going to be bright at night and dock in the daytime and same with a window window might be bright at night and might be docked in the daytime so if I know what things are what they purpose is reason about them and come up with better landmarks for localization So this is some recent work by one of one of my one of students in a lab Lauchlan and what he's doing it's a slam system and lots of people working on slam systems but what he's trying to do is to connect these bounding boxes which computer vision people love these label bounding boxes run video put bounding boxes on it say what's in the bounding box but the real world three dimensional So how do I connect this stream of bounding boxes into a representation of solid objects in the world and so. If you missed that but the mobility using a quadric objects which are not brilliant for things that are like rectangular or table tops and whatever base is not a is not a bad model and it's got some mathematical conveniences when it comes to come to the fitting so it means bounding box that he's inferring days these quadric Sarette presentation of three dimensional three dimensionality and once you've got those then you can reason predict what the boxes would be Compare that with the incoming data stream so it's really thing the level of abstraction that we use for slam it's above pixels and it's above bounding boxes it's about things that we can reason about in the physical world so that's just a quick tour of some of things that are happening in the area of of vision based localization. Area that are very very very interested in his work of missile from some of my Ph D. students is the vision front and right there's a whole range of interesting imaging devices that are available to us now that a better than what we have built into our heads so. We have cameras with very very high iso writing's four hundred thousand cameras now with two point five million I saw the writing which means that they can work in almost complete documents and take images three sixty degree vision and then multiple aperture cameras will light field cameras and talk about those in a moment have got some particular advantages but talking to. There's lots of parts of the world where times of day or parts of the world where the line in levels that we love our eyes can adapt chemically to those levels robotics community we don't really talk very much about low light conditions as colleague Paul Newman says in stock exactly half the time if you're outside and you want robots that can work twenty four by seven and not have to go to sleep when it's dark so here is some cameras video camera from Canon which is this one's running at four point five six million I so it's a picture taken almost complete darkness and this is the Sony I seven S. We have one of those in our lab. Now if I take that into a very dark room and pointed at a shot and this is the Blue time. Stand it Macbeth color Checa you see it's got some color speckle on it interesting it's a blue tile but it's got a whole bunch of colored texture which really shouldn't be there. Doesn't it all right. All right so it's not blue and it's got speckled All right. All right. So you think what's what's going on here and if you look at what happens inside a modern digital camera there's a lot happens between the sensor and when the image gets put onto onto the memory card when the first steps is what's called the buyer says way you take the image which is got the printed color filters over it and you convert it to a fool. So this is the buyer and step has lots of algorithms out there for doing D. by airing so it is colored by tricks and convert it into spots red green and blue images and then you want to kind of interpret the red green and blue in all the gaps and there's a ton of literature and ton of algorithms out there to do it but basically it's a whole lot of arithmetic happens between adjacent red green and blue pixels. And that all works well. But when you work at very low light levels the noise distribution from the sensor is no longer Gaussian you make lots of assumptions about noise being Gaussian in the case of low light levels the distributions of processing distributions one sided you can have a negative intensity so the peak bunches up against this axis Now if you do this. By hearing process you add two plus on this tribute together you get distributed noise the difference of them is not a problem distribute it's funny thing called a skeleton distribution right and if your assumptions that everything is Gaussian add add and subtract them in the distribution of the sign you are misled so if we take that into account we can recreate the his that the color histogram that we see here. By more the wing much more accurately the noise distribution and the buying process. Which is which is interesting so it tells us that the problem the pod that we see the reason that we get these color artifacts is because of the deep by hearing process now this enables us to take a bright image image that I capture the world and simulate what it would look like at low light levels through one of these cameras we can simulate the noise statistics and they work quite well this is the noise statistics the real camera in low light and this is a statistic of simulated low light imagery but it's much easier to take existing images and convert them into fake load line images than it is to go into the dark room and take a load light images. From one to train and you will network I could take all the image in it and turn them into synthetic low light images and then train your network there's no way I could recapture imaging that in my back room it would take a very long time to know we can do that then we can look at how does something like a regular feature detector the brisk feature detector in this case how does it. Spong to these different sorts of the buyer ing algorithms and we can see that the by the bundle by linear with a gradient corrected by Linea is by his father better results than some of the other De Baron algorithms so this is may be small news not exciting to many people but I think this is a whole area of how does the canon of computer vision algorithms we come up with respond to low line images is a whole class of imagery no one's looked at before so we just need to develop some tools that let us explore which algorithms I'm most robust to these low light load light images the only area that's interesting to me is shiny and transparent objects much student Darrien science being working on this is much much more common than we than we think and its rewards I think we just we don't see pretending that they're not there what's probably experienced yourself as a shiny reflect the thing that as you move your head a little bit the reflection moves in the reflection moves in a different way to the rest of the object. That's pretty pretty typical characteristic of a specular reflection from the surface of an object and if it's a curved refractive object like a glass is a move my head things in the background move quite differently to the rest of the background so I think or can we use that information to work out whether something has been refracted on not and but rather than actually physically moving the camera on a robot which we could to we are interested in this whole area we call light field cameras anyone here played with a light field camera. Good. I feel camera lit see you capture not just the the light rays that would fall through the pinhole the aperture of the camera but basically all the light rays that come through a window. So this room is full of a lot of rays of light bouncing around everywhere a plane up the camera light field camera lets you capture all right as it comes through a window rather than through a hole and once you've got that then you after the fact you can move the camera around you can adjust the rays you're looking at you can change the focus you can do all sorts of really cool things a way to think about light feel images is it's a right if it's in the right of images captured by and right of cameras so we've got a grid of cameras each of them takes a picture so in that with a grid of images so it's a two dimensional array of cameras and two dimensional array of images it's called for the life field of four dimensional life simple explanation of what a lot of your images so I think of it as a whole lot. Images all taken from slightly different points of view so we have a normal image to take with a regular camera three by three light field image we're seventeen by seventeen light field image which is what we can capture with the big chunky LOTRO illusion camera two hundred eighty nine distinct viewpoints captured simle tiny when I click the shutter and that's very powerful Now what we can do is. With that light feel it means that we've captured I can just move the camera backwards and forwards and here is a glass spear on a bunch of taro cuts but that's about. As and I can then take what we call if you poll the slice so I can say how does the horizontal coordinate of a pic so very as I change my camera across in a horizontal direction gives me a bunch of slope lines in the slope is related to the distance so this particular one we see the slopes of the different to the slopes they have because these pixels are closer to us than the ones out there in the red box I can do a similar thing vertically so these what we call an A.P. poll slice contains really rich information the slope tells me something about how far away those pixels are but if the object if the pixel was in the middle here it's been refracted then as I move my viewpoint it doesn't the A.P. poll lines are no longer straight they could that's the nature of the refractive optics so if I look at these people slices and make a classification of a line straight or could I can say that that pixel is being viewed directly or being viewed through a piece of cross so I can say whether that point is being reflected on all and so in this case example here this is a picture of mountains recurring theme with a glass cylinder in front and the ones that are painted red are ones that we deem have been viewed through a glass door into our being reflected in the blue ones we call ambition features the light scat is in a normal way and that's the experimental set up here glass on the in front of a picture of mountains. So why is that useful to consider a sort of a classic computer vision algorithm structure for motion so I'm moving a camera through the world and I'm trying to figure out where the camera is now if the point is there are points in the world that are not moving. In the way you would expect them to move they can introduce noise in error into the structure for motion algorithm so if I can remove them and say these points of being refracted they're going to mess up the structure for motion algorithm can filter them out we end up with much more robust structure for motion estimate with eliminated all the feature points that are behind that glass window which is a useful thing. So that's right field camera on the end of a conniver arms you can see it's a quite a big and chunky big chunky camera and quite heavy Another thing we can do with a lie field camera is to reformulate favorite algorithm image by special serve on. Which I'm hoping some people in the room A familiar with that basically what we do is we look at areas in the image plane them through the inverse of this Jacoby my tricks to figure out the special velocity that the camera should have for light field camera not only we get the coordinates of points but we also get depth inverse depth information we can pull out of the camera directly because it's got that multi view information so we can reformulate visual serving using a light field camera and it looks like it looks like that it behaves just like the regular image based visual serving over and does same robustness and convergence characteristics which is very helpful but it's gone also great robustness to two occlusions And when we did this work we wanted a camera that we could use for the visual server on we needed a video light field camera the bigger loom camera some take snapshots and put someone in this the cot so I can't use that in the loop to control the robot so with the building light field camera and the easiest way to build a light field camera was to actually in bed nine mirrors into our three D. printed housing so each one it's an array of virtual cameras but up there at the top is what the image looks like that we capture we can chop the individual images out we've now got a three by three light field camera and then we can compare a light field image by special serving all of them with a stereo image by special serve going algorithm the paper we published in robotics or automation letters last year but what's really powerful with this is if there's some occlusions it is a moving this from occlusions between the camera and the target because there are so many different camera thing. Didley virtual cameras looking at the scene one of them might be clued in but not all of them are ever going to be excluded so it's got great robustness compared to a regular perspective camera and you know if you're looking at a scene straight through through a wire mesh fence right you can just move ahead a little bit and then you can move that piece of wire out of your field of view we've got mine eyes looking at the scene always there are going to be some eyes that have got an unobstructed view so that's very very powerful simulation or talked earlier about the unrepeatable experiment. And invited argument about this light up so if we can't repeat an experiment we can't put robotics or body vision on a strong scientific. Found Foundation what can we do I think one answer is simulation and simulations come as come a long way so this is some photo realistic simulation student John Skinner is done using the Unreal Engine which is a really popular platform so it built a virtual world and we can place the camera at an arbitrary pose in the virtual world and take a picture put it through my algorithm move the camera and repeat so we cood. Run robotic vision experiments entirely repeatedly in the simulation environment and I can do things like I can change this might be very difficult to see what we can change camera camera focus settings we can change the lighting conditions we can do all sorts of things we can take care algorithm and we can torment it in a very repeatable way we can say how much can I turn the light down before before my vision navigation algorithm breaks right which is very difficult to do in the physical world so some of the work in this was to imagine are a camera driving along a road and probably let the video run that probably So if explanatory So we're running a camera along a virtual road. A number different offsets from the middle of the road so along the middle of the road bit to the left bit to the right I'm going to turn the sun up and down and then we're going to use these images that we've generated we're going to put them into localization algorithms like sequence lamb or slam and see how well do they do given that I've changed the viewpoint change the time of day so it's quite a big experiment in in some ways with a lot of variables and we can summarize them in this way as if one precision record score for sequence lime and we get this really strange as we can see that the era depends on where my camera is with the specter of the middle line of the road and the time of day. You probably will even I still don't understand what is there a surface means but there is an error surface and it's not constant It does depend on where you are it does depend on time of day and you could not have got this data a thing in a physical experiment and we don't know if the if the air if this is a completely an artifact of the simulation or whether it's real but it's interesting. Here is it is the same thing for them and it shows that the air is generally pretty low except if you're in the middle of the road Dorna never gets worse again I've got no idea what that's about is near an open flame is an error in this simul in this methodology. So we started to push this a little bit further and this is very very busy what we've done is taken a simulation environment a bit like the Muroc dataset we've done off in simulation and run a camera through it so cameras following exactly the same path we feeding those images either monocular stereo or we can simulate all those different sorts of cameras into opes land and asking it for the path to the result you get is not constant from run to run right so every time I run it I get a slightly different result there's a fair bit of consensus but occasionally you get an outlying path. And you think happen is B. how can it be an algorithm where I put exactly the same sequence of images into it and I get a different path out he's ransack right so it computes features and it uses a lovely algorithm called ransack which is the best outlier rejection algorithm we have but based on random numbers and not expect that filters through interviews all to get them I don't know anybody who is very widely used and I don't know anybody who's ever stumbled across this before and it's only because I guess we were. Stupid enough to just put the same data through a plan time off the time off the time to see what would happen if we got this and we thought we spent ages debugging and thought it was a timing problem a thread problem we tried and also the different this is a characteristic of slime which we would not have seen and this was done experiments on it in the sort of this picture a dish of proto realistic simulation so I think it's got some merits. Seriously similar things will leave. And then we think we'll OK. If I'm going to do this visual simulation how realistic does it need to be so I can have very realistic pictures shown on the top list realistic ones along the bottom the bottom ones are cheaper to compute so her release the doesn't need to be. And this is a hard thing to figure out so we've just started to experiment and instead. Put these image sequences high res ones low res ones through some different algorithms and trying to look at the error. ERA accuracy precision measures to see if there's a difference there is some difference there is some signal led but we still don't quite understand what it what it means I think it means that better simulations A closer to what we observe in the real world but it's almost impossible to build in simulation. A physical environment so to build this in stimulation would be really hot so to compare. Results we can get the same simulated environment with different levels of fidelity easy to do it's combat simulation to the real world is very difficult and probably have to be done over many many different environments and come up with some of on some statistics All right. It took a bit more about computer graphics some of the things that it can do we can compute not on the images we can compute depth we can imagine a depth camera we can imagine a surface normal direction camera we can have cameras which give you the image that class of the object in the scene so we can build synthetic worlds we can pull out all of this information and we can use that to train deep networks so in the problems with the networks of getting lots of labeled data we can use simulation to provide that label data and this I like this because I took before about this forward vision problem the computer graphics problem being a solved problem we can use it to produce to train networks to solve the more difficult inverse problem I think has a certain sort of nice symmetry about it. What else can we do with this we can test the performance of an object recognize a so if I've trained in your own network that recognizes a cup I can produce an image of a cup in a bazillion different orientations feed them to the neural network and say Where did you succeed and where did you fail so green shows the are intentions of a comp that this network recognized recognize the cop These are intentions were did not recognize the cop so we can explore the performance space of a of a neural network by testing it with simulated data. Now if I just generate images of cops in isolation there's danger of the feeding so we looking at what we call procedural generation so we can create a whole lots of different environments with different cups in them in different places and in different environments and different wiring conditions so it makes a. A much richer environment. For creating data to train deep networks. And last in in this segment is about her body vision challenge that we launching hopefully in the I was out of Google and the idea is to do for a body vision what imaging that did for computer vision so instead of an image data but yes they did sit which is what the computer vision people use we don't have an online simulation right so you can test your algorithm by you upload a pose we give you back the image you upload another can give you give you back an image we can come up with some performance measure is new and we can add add things that you weren't expecting like you can change the lighting conditions or introduce people obstacles move stuff around so it's a way I think that we can finally put robotic vision on a on a strong foundation of repeatable experimentation and hopefully that will lead to a rapid rapid progress in the field as we've seen with computer vision so that we come in in the end up last section of research in a lab I want to talk about and rather than going quite slowly grasping to manipulation is an area I never ever thought I would do work in one of my students got really interested in this paper from Google the mind I don't know when it came out but it was basically about skill capture they created a deep reinforcement learning system that could play forty different Atari games quite well and we thought well that be really cool what if it could move a robot towards a target and so we built a graphical representation of a robot that simple task was to move the robot to a blue dot and it worked OK in simulation but as soon as you put it on a real robot it brought it failed miserably and so that's probably not surprising and so my student found a very very persistent individual he worked on lots and lots of thing. And then up with this architecture and I'm not going to go into the details of it but it sensually. Unlike the Google deep mine network these networks got two parts of your perception network and a control network and we train them individually we train them separately and then we tweak them into aimed at once where once we've trained them each individually sticking together and then do some more training and we can train this in sim in simulation and then we can lift that to the real world and we've been he's looking at all sorts of techniques for this transfer from simulation to mining to the real world to mine so this is a network that takes an image and outputs seven joint velocity commands that's what it does imaging robot joint velocities out and we train it for a particular task which is reaching So it's trying to reach for a blue object I show you a video of that at the moment. I guess the big achievement that is made is we can train on a bazillion simulation images cheap easy to do what we want to do is to get the skill in the real world we want to have the minimum number of real world images to train on so he's used this technique called adversarial discriminate transfer where this is the performance blue is blue is good and this is the number of unlabeled real images so we just take pictures of the environment in the robot we don't have to label anything and this is the number of images that we have to label so in this the mine down here we only labeling ninety meters or something like that he will labeling forty eight images and just giving another forty eight images that we're not labeling the performance is pretty damn good so this is the the network running. You see we're back to the robot we using hand camera on one arm of Baxter and the are is reaching towards blue keyboard so it's been taught about clutter so in the simulations that it's that it's learnt from this crowd of the blue object is the thing that it's got to move towards So this is during three degree of freedom reaching. Image in seventy seven velocities out. It's been it's been quiet quite difficult to get to this this point but I'm pretty excited about the about the results so this video shows fanging tormenting his creation and he just continually moves the blue cube around then it's always it's always just relentlessly going going off so I think this is this is a very nice result in deep reinforcement learning. In two thousand and sixteen we ended the Amazon Robotics Challenge anyone he competed in this. Way We didn't do brilliantly in two thousand and sixteen but we ice that in two thousand and seventeen which I'm very happy about so we wanted two thousand and seventeen. The competition's quite challenging particularly in seventeen there were fifty percent of objects that you've not seen before and was on Gavan So you thirty minutes before the competition and if you used to deep learn to classify it to recognize the objects you had thirty minutes to do the training to collect the data and do the training actually. We used a low cost copies in Robot three D. printed grip this is it in action it had fingers on one end and it had a pinch on the other end and yet able to classify all of us all objects figure out where they always a good Grosse Point reaching picked them up using the appropriate the appropriate tool. So this was quite an achievement I'm very very happy about. What's under the under the hood in this thing is a network called refine it which comes from my colleagues at the University of Adelaide in Australia. It's quite quite a powerful network for semantic segmentation So the import to the vision system is an R.G.B. image in a depth image and then outcome this. This image where the objects outlines a tryst and then nine and the probability that that's what it is it's listed and that's really all the information we need for the robot with a computer surface know most helpful for picking up some sorts of objects I mentioned before about the rep a training so here we go putting objects in a number of different orientations about five or six different our entire nation we capture a picture and we had a chef The full of G.P.U. cards which we used to do the training in the thirty minutes before the competition started so these guys got very very good. At doing this repartitioning so smart smart students and also in. Deep network. Last thing I want to talk about is not student Doug and he is pain looking at a network to synthesize grasp the rectally So at the moment you take an image you do some grasp planning in the robot executes the prime sort bug spring during is a network that outputs in less than twenty milli seconds across plan so given R.G.B. image it outputs three maps quality across position where you should grasp how far apart your finger should be and what the angle between Get it should be right and once we've got that we're doing twenty million seconds we can do visual serving in the grasp loop Let's go quickly through this and just show you it working. So you can see on PIC visually serving towards a moving object and doing a grasp so this is doing cross planning multiple times per second and it's got a lot of advantages of all the planning systems that are out there. So I'm probably tunnel got some applications that I can but I can skip. And go straight to question conscious that on our Yeah. Our. Group have a patient crowd. Let me skip skip poster and I just go to application so the these kind of the main research threads in my lab things that I got five years ago I never thought I'd be doing anything in the learning I think probably nobody thought that we're going to be doing anything in the planning five years ago and I didn't think I'd be doing anything in grasping Bob sort of fall in accidently into those and some all tricks like visual serving sort of come back and you guys are now with visuals that I'm going to grasp ice rather than rather than any spice which is interesting let's quickly talk about some applications where we put some of these robotic Invision ideas together to create systems that do useful things in the world so we've spent quite a bit of time looking at the core board like agriculture these are operations like plowing a weeding or something in a field and figuring that instead of having a really big agricultural machine to go out there and do this work he get by with having more smaller cheaper machines to do the work so we're going to sort of swarm robots for farming that was the that was what we promised the funders and so we created this first generation robot which looks a little comical. And it is navigated entirely using. Using vision so there's no laser. No laser sensor here optical detection does crop or a following internally is in vision and this is the the pipeline for the crop growth following what we do is we take an image we find the horizon we rectified the image so the horizon is level and then we take an artificial overhead view so we walked an overhead view that stabilized perspective horizon and then we track a feature which is pretty obvious in that synthetic overhead view and we use that to steer the vehicle one way or other. The other application of vision is stereo and we've got a couple of different stereo. With things working together to robustly identify obstacles within the saying so the blue area is where it's looking anything that's red is an obstacle and we have some challenges here because it could be in this particular case if you driving through a crop interested in things that are high it's probably an obstacle but they could be something that's the same height as the crop that is an obstacle so say that somebody's lying down in the grass the same height as the grass you shouldn't drive over them because they look different to everything that's around it so the strategies he was looking at local dissimilarity and the others looking at at height fusing them together in a way that there's Pat. To to come up with something that worked that was pretty robust is not a laser scanner so we're really trying to drive the cost down by using cameras computing and in very low inertial senses so the second generation the robot looks much much nicer so this was developed in-house and underneath it is in a right of cameras and weeding implements so the topic here is the drive or riff or a crop field recognize individual plants so this is early in the season when there's not much else growing and actually classify those plants that it's a good plant or a bad plant. If it is a good plant we leave it alone or in a bad plant we terminate it and there's two ways you can do that one is the poor poison on it which is pretty traditional because we know it's we know where it is we can get by with pouring much less poison on it is normal in agriculture way that it's right everywhere so we can do very targeted spraying of all purpose side is invited has brought environmental and financial benefits showed some of the targeted spraying or we can mechanically mechanically excavate the plant. Which looks cool it's kind of comical but it's got a lot of advantages but point poison on plants for fifty years we've done a lot of them they're resistant to this and it's something. Worries the agricultural sector a loss these herbicide resistant super waves and we hope it will be a long time before they can evolve resistance to being struck mechanically point out of the ground. The imagery that we get from the vehicle rolling over the ground allows us to we can stitch these together and basically build. All we distribution within a field which is which is useful for planning purposes. Where we've seen where we've seen weights. Another robot we create is called hobby hobby picks a fruit that I would call a capsicum and I think you call a bill pep up. Yet so it's a U R five with a need to a real sense camera moves around a bit to get rid of a obscuration from from leaves. The stalk and put that you know put in a box he's now about half the half the speed of a human pick up this is the robot eye view so it's a colored point cloud there we use that for pot planning and come in and grab and grab the fruit so this is an area of of our body can do a run very. Very enthusiastic about we have a problem in my country is. In agricultural labor force we don't have undocumented workers were on island a long way from anywhere right so if a company if there is Dragons can't be bothered picking them and we're not going to have them in the shops so this is kind of important nationally that we can do these things up very last thing I want to show you is another robot very proud of. This is the stuff ish the thing called the crown of thorns starfish this is a native species in Australia they eat coral so did you not today being if they eat a coral reef it looks something like this it becomes a cold day is it and so there's no known name colorful Seaway not it's all gone. Right and these starfish appear in plague proportions at the moment because we've done very bad things to the local environment through that bag of cultural practice so a few years ago they developed a technique way to scuba dive they could go along with a needle and some poison and inject the stuff the Great Barrier Reef is about as big as Italy so the deer are going along with a needle in injecting every stuff ish is mad right so i thought i Robot could do that so we thought a robot that would terminate the crown of thorns starfish and so this is some trials from the stuff which robot. Back from two thousand and fifteen and so again deep when you come to play here is he's a robot being pushed by a scuba diver because we want to acquire the image is a crown of thorns starfish there aren't many pictures of the month wake up right so if you want pictures of this stuff if you go to go and get them yourselves in the best camera we had was the robot who pushed that around gathered a lot of images trying to network to recognise the stuff and it kind of distinctive and then this is the classifier running sees starfish in it it gets excited. So when it sees it the robot will in stop it can hold up works that it knows some stereo have far away the Stop issue is does an inverse kinematics solution reaches down inject inject it tested in sand and then in a moment it will reach down and inject this stuff which we thought this was pretty cool we've done a lot of test it works but we can't get anybody motivated to take it to take it any further that I guess is not quite sure who's got financial gain from employing this robot everyone has stuff to lose if the reef dies and it will die sadly this will kind of postpone the end a little bit but trying to find a business model to make this work has a scope to so far so with that I'm going to going to stop. You say that all of the work I've talked about is. Vision it's a multi University partnership seven year research grant twenty six million bucks the strike in dollars on the direct so they were forced dragon universities and a bunch of international universities one of them is Georgia Tech. And it's because of Frank the Georgia Tech the Georgia Tech is there so. Because Georgia Tech is a part of the you're welcome to come and visit and some of which is February middle of your winter we go to this place by the beach and we have a summer school just really really nice. Thing to one of the summer schools yeah there's kangaroos and beaches biking and hiking and all sorts of things. If you're just in coming to summer school hanging out in any of our labs playing in any of our projects talk to your advisor Thank you. Yeah. We're. Really really well you can do. That. But. It's pure computer vision with I.G.B. imagery we've done some other stuff with. Infrared as well so I near infrared which is very good why this plant small thing but I was wrong that was just. You're welcome yeah. Which I really. OK So the question was about more detail about the the neural network that synthesizes the grasps. I don't know a whole lot about it there was an R.S.S. paper this year so check out our Says paper by Doug Marson essentially what he did was take a grasping data set it was the Cornell grasping data set and train a network to. I'm not sure what the last function was but given an R.G.B. image now come three images. Way you should grasp the orientation the grass and the thing the distance right and they computed that inference is very very quick and you then just find the best possible best possible feasible peaceable grasp and go for it how you start switching between different possible grasps I don't know I'd suggest reading reading the paper. I'm going to put you keep these I can give you a P.D.F. of this and a lot of the videos all the sections of God are referenced in there so you could trace it down down that way but it's pretty recent results I mean it's only first saw this just before Christmas of last year and came together really quickly and are says Piper come out of it bags now going off to interning at Amazon in Germany so is this point of the Marxist grass pony things that are saying it's very interesting work yeah. I have had so many conversations with people in the world of construction. And they have lots of reasons why it can never be robots. If you. Construction is a huge section of. The economy of any any modern country we spend a ton on construction this huge amount of manual manual labor goes into construction so I do the stats somewhere on you go to an i Phone and it's like ninety seven percent assembled by machines and three percent by people when you look at a skyscraper and it's sort of almost the other way around so it's it's a many a building is a manufactured product but it's got a huge amount of manual import. This There's a conference series on construction robotics Japanese have a big push to it in the ninety's it just never seems to have stuck anyway and I don't know why. I've heard that there are issues around how you would fund it would be you would have to be the people who manufacture equipment for the construction industry people build cranes and forklifts and all of the people who would probably have to invest it's building companies won't do it. You need big construction companies to lean on their equipment suppliers to produce the equipment that they named But I think the economic driver is just not there that's is my hunch. Or no one's twigged to the fact that there is a negative but the big economic incentive but it's notable by its absence. Bought it yet. Not in the yield we've got statistics on I guess energy imports and herbicide and her Scion ports and you know the herbicide has been a reduction is out of an ninety percent it's quite it's quite significant I don't know how that how that trance translates to the old thing we haven't yet done the whole a whole season we tackled weeding initially you can't move the chemical waiting because in any in the agricultural cycle you plant once you harvest once but you will need many times so we figured we go off of that operation because you get most bang for your buck the only one is that it doesn't unlike planting planning involves mechanically interacting with with the ground you've got to drag a tie into the ground or fire the seeds even to into the into the ground so that's kind of mechanically difficult and you've got to pull the top into the ground then you need to have much more attractive force so we're doing way just spraying stuff or putting it is relatively relatively easy so that's why I went for that one that one first people looked at other techniques I showed spraying and mechanical cultivation people also looked at Michael wives just farm and particularly the up the magnetron out of a domestic market we've all been at the weight for a second and it will die and it's kind of not moving parts probably occupational health and safety issues but. Your question yet. Simulation. Absolutely there are ways to get involved and I can go back to I can go back to that slide if you if you like you got the picture. Please we want every everybody who is interested in his in his topic tonight to have a go up get in with get involved again on sabbatical a from my lab since the beginning of June so it could be the out of touch with what's going on so. I'm not quite I don't know what's going on they tell me November is there is a date when they want to have this up and running but with that's for the actual competition part where there's material available in advance but not going to check it out it's very. Very. It. Is some of some of those things at the moment just inside the lab but if your industry contacts contact the author the most a simulation works a guy called John skin up his name will appear on the various of those slides so contact would contact him but with the simulation I'm pretty sure we can offer it as a service rather than give you the simulation because you could then sort of reverse engineer the simulation card so that we can make it as a service so we give you the image you know we're not going to tell you what the light looked like it was up so we can make it make it hard if you push the field forward. Go. You so you next. Yep. That was a that was a strong motivation for going to the swarm of small machines was compaction so. That would that we move toward simulation rather than in physical robots. I confess to being a bit conflicted I think it's really important to have physical physical machines to test your ideas down because you. Are some things are put to put to the torch. And that's important but I think what we lose the ability to do really strong rigorous experimentation and I can see a way a way around that using physical robots that see anyway cost a ton of affected so many things simulation is going to win there I think if you do everything in simulation I've done I think you can be potentially diluted by out a fax machine but I think if you can do some sanity test with a real robot in the real world but. Apples with apples comparison of algorithms all of an hour with an environmental conditions I think simulation is by far the best way to do it and it's got it can only get cheaper which is the of the wonderful thing about computation you know physical robot hardware cost does not get a lot of the time. Simulation cost does get a lot of the time and I think also as more and more approaches end up being data driven people going to take one thing they were going to come to train these trainees networks it's going to be too time consuming to physically capture and. Simulation is a way to generate that data but there is a problem of how then you move that algorithm from knowing about what the simulated world looks like to the physical world and that's ways demining transfer columns becomes important that's an important and that is a big area of research a lot of results out there now. Beginning to understand how to do that. So. OK. We haven't done the refraction stuff or a camera crossing for a view crossing our sort of water boundary and we've not done that the postdoc who drove a lot of this work don't dance or he's a Ph D. He was in using a light field camera for underwater imagery where you often have a lot of particular matter a lot of floated and using a lot field technique you can basically make the floaters go away so that was the denying that he used it for he can also do some pretty funky stuff which I still do not understand that how you can use life your cameras to effectively use the gain of the camera system so that he it's got some advantages but a lot of light situations as well as being all through remove particulates So check out the work of Donald dancer a dancer a DA and it seems I knew. He was my post but for a while but sadly he left. So he's done a lot of a lot of work in life to come I wouldn't Dunning like the cameras if you hadn't come to my lab was gone one of those serendipitous things that's pretty pretty cool. Yeah. Yeah that's a good question. And probably a few answers to it one is there's an existence proof we know that perceiving the world in does particular bits of the spectral of the spectrum. And Mabel's us to do all sorts of things if you say you want something that can do everything that human can do we know there is enough. So that the motivation I.G.B. cameras really. It's a pervasive take commodities they low cost and part of me thinks that we know what's possible we can just keep kind of try to try to not do that but I think it's probably the the stupid and hard path to take you could point in I.G.B. day camera at the same stereo cameras are getting a getting better. Cameras it is not too much ambient light a lot of those who work work nicely. But I think it just it's a matter of principle we know it's possible so let's try and do it. We're going. To do that. Yeah. Yes there is this issue that was raised here over the course of compaction So as the machines get bigger and heavier which they have over time in order to become productive. You start to damage the soil structure so maybe in a fine way you've got the ritual roadways of the has got to be closed up and down in price you lose all productive capacity there are many and you're losing maybe one percent of your land. And maybe that's a cost that we are willing to bear the bigger question always why have the agriculture machines got to be so big and PA The reason is because we can but also it's because it makes the driver more productive if he's in a big machine you can go faster and do more stuff and as a robot assist you we should always ask why is there a driver. If it's got a lot or if it's got a lot of wheels. Yeah. Yeah yeah I think we were just interest in the fact that equipment evolved in a particular direction and we think it's evolved in this particular direction because of the human driver and OK Could you just kind of go back and say OK we took a wrong path we don't they drive this what do we do now what is the design space look like so one has to have lots of small things that tread lightly and I don't know whether it's it's a viable answer technically it's possible but whether we can build a business around that I don't I don't know.