Let me let me know what I asked you to be sure didn't I love that book. It's one of the great. So this talk is about the video summarization done by my two students mentioned here and just to show the problem. Here's a short clip from the tech elementary or I guess on why we're here. The good old days. OK Few haven't got the soundtrack couple of thousand I'm going to need teams working in shifts going to take days. This is basically how analysis of surveillance video is treated these days I don't know what his assistant. Meant but he said Not necessarily but what we're trying to. Dressin this talk is a little step toward dressing this problem. So in normal cases that were a terrorist attack. These are two example of cases seen in Europe. It talks for weeks at least a couple of weeks until the video that was collected from the site was analyzed people went on the video in they find pictures of the terrorists here and there on the video. The same and more recent thing is is this case it was in it was this was in Dubai two thousand and ten and again you take a couple of weeks and people there underlies the video found the terrorist in the video. But again all this time was very long when he takes weeks to find what's going on the people or involved are long gone. Something it's really different wires in in Boston in Boston. It looks to the F.B.I. couple of days to find the terrorist on the video. What was it so short. So one. If you look at some of the local newspapers and others as rumors Israeli technology may have helped and then defy the Boston bombers. Well that is very technology mentioned here maybe the one I'm going to talk about now. But we have no idea if it was really used in Boston is a client has this. So use it but the F.B.I. does not disclose what technologies it uses in its Vista geisha and so therefore I didn't acknowledge what whether they did or did not use it in Boston but the fact is the Boston case was the first time that videos. Were I never in a couple of days rather than weeks to find what happened in the scene. Actually this is recent deployed in the Statue of Liberty Empire State Building has a places in China. So there are some notable places that use this technology. Now let's go back to the problem the problem is that we have millions of cameras all capturing data twenty four hours a day every day there's more cameras installed all the time each camera has more resume and there's not enough people to watch all of that. And given that the person can watch such video for about twenty minutes before losing your attention of falling asleep. It makes their notices even more more difficult. So the result is that any recorded video is effectively lost unless there is some major crime or major terrorist case were actually they take says dozens of detectives work on shifts to find the what happened on the video. So the solution that we try to do in computer vision to this task is try to do it automatically try to detect and track the object. This is a big fan suppression try to classify the object people cars. Maybe recognize individual people individually cars by the license plate record now is the activity that the objects are doing for example a cliff like a street fight the trouble of in this case is there's a lot of progress done but still a lot of more program remains. Now even if even of all these automatic analysis will be very successful will still would need manual. Inspection the new inspection mean that the people will look at the video. So again the first two reasons why we need people we need people because automatic analysis is not perfect but the more important thing is that the picture is worth a thousand world and the video is worth a million words so and this will not change even if automatic analysis were perfect activities cannot be described very accurately. For example even there's not enough words in English to describe colors. There's order of magnitude more colors that we can distinguish between than words to describe colors not to say shapes motion activities etc So language whatever way the computer analysis will be able to analyze we cannot describe all the variation of the nuances of what we'll see. So people will always need to see what what happens. But again as we mentioned the too much to see in man and equation. Even after we've filtered we only want to see police cars that went in this road riding more than forty mph. There may be hundreds of such cases in a day or a week and then no one has the time to watch all of the cases. So the simple approach to do this one approach is key frames based We have a very long video and out of the video we take a subset of frames this subset of frames can be like single frame every so often or groups of frames or smaller video clips to show motion but it is based on on frames a frame there is the main entity that cannot be broken. This is a method that we are part of it's a measure of object. Based we don't care as we care about objects we do to contract object and then we show the object and once we show objects we can play with time as you would see in the in the taco especially we would like to play object single tenuously even if the object occurred a different time. So in the basic video synopsis we take all the stages of the analysis video analysis and we assume that only the first stage is done this is the object detection of tracking and this is not a difficult problem when the cam my stationary when the camera is stationary as most surveillance cameras are anything that moves is an interesting object so to find something moving in a special camera is the easiest task that they can be and we assume that this is done and we take this trip isn't it in the brief time so that people who will be watching this object will do the understanding this is so this can show the results. So the result of your synopsis is the following on the left is the original video is a video that basically one day long. You want to see the entire video of course and on the right is the synopsis is a summarized in a minute. All the object that happened during the day are shown to gether in a brief time you see that it's very very short but very crowded on the object a shot together so that when I give a talk in Bungalow people say but this is how bungalow airport looks anyway. Actually when I took off from Bangalore I looked out of the window and they were right. This is how bungalow airport look anyway it's a video on the right. So wall. What we present is a fast way to browse the video and also index. Because once an object is of interest. We can actually point to this object every object has a time in the original video can go to the origin time and show it and the idea again is that object at the piano region a different time would be shown symbol tenuously How can this be used it can be used in in several ways. One is I have a budget of three minutes before going to lunch. I want to see everything that can be shown in three minutes of the most important object you can turn three minutes another thing. Show me all the object in as short period as you as as as possible. I can postpone my lunch of couple of minutes if necessary and basically this is only a presentation layer and before it starts to talk about this. I'll talk about previous work we did on time because showing object from different time similar tennis Lee is is in a way screwing up time so time is not chronological anymore and the idea of of basically not keeping the chronological property of time happened in an earlier work that we didn't was aching to go and look at most liking. If this is a video and we'd like to Mosaic this video and this is a normal mosaic that will make up a video. What's missing in this music. It's the wind the wind is missing the wind is missing. It's blowing Can you see that when blowing in the branches and the leaves and the first we do we don't see it in the second video and it would be nice if we could do this video on the right where we see both a mosaic and the wind and the people who are doing music and for many years and no one did this music on the right and and there is. This is simple in order to do the mosaic on the right something wrong happens. What what's wrong. If we look at the shadow on the ground. We see the shadow underground several seconds before we see the branches that cross this shadow on so we see now in the shadows several seconds later we see the branches on the right. We see the branches and the sugar moving singleton you speak so we know ordered to do this we must screw time then cannot be kept cannot be kept could logical anymore and and this is basically a kind of mental block that we had to overcome in order to do this. This was aching and once you block it. It is easy to take it to other places. So how is it done so assume this is an original video this is a video with a camera painting from left to right. These are it was a falls between Brazil and Argentina are said to be the largest around that actually a very tiny portion of these the really huge falls and and if we take this video on the top. We just do it in the space time if this is X. and this is the time to see this one frame another frame so all the frame and moving from left to right. So this is where you would have this field and the most like you saw on the bottom is basically taking the right parts of each of the frames and stitch you get together. No we say well let's do another music not take the right part but still slightly lift to the right then take the center of it take the lift part. So basically have a set of mosaics made of the same video. Each time picking at different parts of basically if this blue line represent this part in the original frame that we took. For the mosaic we have a sequence of six each We're taking a different part until the last must take the left but now what's the different between taking the right part and the left but the right part is just women a piece enters the mosaic and the right part is one the piece goes out of the mosaic. So this is the first mosaic will show an object when it entered the mosaic and then it grew until the object goes out of the music and and if we play the music. We'll get this one this this video. So we see in this video we get the kind of Rama we get this water flowing and this on the right a flowing and we see water flowing together even though originally they never happened. Together and the reason is that if this is an image we saw in this image object in this time and the earlier time and the beginning of time we see all the different times we show them together so this is how we can manipulate this image by showing together activity that were in different times. So let's look again at the video that we wanted to move to change the camera from painting from left to right to painting from right to left. What will be the natural way which is the camera printing from right to left leg backwards. So here we play it backwards the camera pans the right to left but gravity works in the reverse. But where the with the right thing here is bending the camera for right to left and the water flows down so you see that we can. This is a non-trivial manipulation of time where the camera motion has the right time. But the drops of the right time by becoming motion has a reversed time. OK so how do we generate into the mosaic we have a set of set of frames we run many fold through the frames and every pixel that the manifold touch are collected into one image when we want to see a movie. We move the manifold and this set up now generates one frame. Sorry. And this set up another for us are basically a set of one frame another frame and yet another frame is in every frame is a collection of pictures from this picture from this spacetime. So if for example we take this manifold this money for these basically one frame picks one frame and we chose a many for them each times the manifold his collection from a larger number of frames and and each one of the many forces a different a different picture. For example is money for just collection of these pixels that were touched by this frame so when I say a manifold touches a pixel it's it's an approximation because the manifold goes in between pixels so there's a lot of thing to pollution going around here so it's not actually touching but I will ignore the entire pollution that goes on here but it's not that difficult to realize what the pollution is so one example is good then if you see the demolition of the tower in front foot couple of days ago there were there was a total of thirty two floors in front for that was demolished two days ago. Well the reason I mention because there isn't one company probably the same company that demolish is huge structures in the universe and they have a website on which to take pictures of their creations so if you like to see destruction. This is the place to go. It's really thinks of the slow so this it is taken from the website. There is the King Dome in. Seattle. So it's here is the monish it's all in the uniform was the fill down uniformly if you wanted the don't to fall first in the walls later or maybe they were first of the don't later we cannot take it again because the truck sure is gone. It's just but we can do this with our stuff we can take the original video this is looking on top at the spacetime volume this is the X. direction this is why and will take this many for what does this matter for the Center at the center would be from the future compared to the side so the side will be in an earlier time in the center would be in the future time. So basically everything that we will see in the center be in the future. So when we run it through the spacetime we see that the dome will for first and the science will come out later or we can do the opposite. We can do the center from the past and the side from the future when you run this. We'll see that the sites will fall first and the dome will for later. So this second manipulation can be done with this but the manifold doesn't have to be static it can be dynamic and let's look how we can play when kind of first sports event if you look at this women competition but it's not clear. We know all of them are the same but we want to declare someone is a winner. So we take this guy and we have his style running faster so if origin we start with the original frame. He stammered go faster. So as we go through the time he gradually would take for the future in his winning the competition. Well if this is a photographer photographer in high school. He's going to serve the video only to the mother of the winner right no other will bite. So he want to sell more than one video then he said well it's like another guy is a winner and this one is from consented to his mother and he goes and creates another winner from the same video. Actually wanted when we went with this Panasonic one of the recent Olympics will give them a recording video recording and video for talk a few of the Olympics we try to offer them so that the broadcast to every country that they're at is the winner but they refuse a No one will believe with you after that. So they didn't get that. OK now we go back to the surveillance tons of cameras camera did not look as ugly now they look much less obvious much smaller but then I would can show them how many cameras are here. No cameras. There's one camera that being recorded. But. It's in here but they're not fixed in the world. So what I'll describe here is me basically to come under the screwed to a wall in the movie and this is the story about this work is that we try to summarize home videos and eventually we declared failure because we couldn't do home with his we could do on the station cameras. So we're telling around our failures this sort of what do you mean it's a failure. Most of video on Earth is from station the cameras and said I don't know I never take video from session so you're not. But look at all the videos around you taking your picture and his was right there much more video taking from cameras like this anyone guess how much of the video on earth is taking by such cameras something like think ninety five percent of all videos taken by camera like this stationary Actually it's even about eighty percent of the disk space in the universe. Not only there are a lot of videos but the take all that this space because think about such a video is how many bits are generated every second. Maybe has cameras so we once looked there was a company that makes storage so E.M.C. said that most of the storage on their devices are bits of. Surveillance video. OK So more and more cameras used and again not enough people to watch the video. People say the professional guards are watching one percent. I don't believe them. I don't believe that even one tenth of one percent of being watched and again everything not no one is watching it. So what being done so filled with this video summary and every frame every walk the key frames or a collection of short sequences or adopt a fast forward. Sam it sound of thought to do more to capture the motion of the camera and there's very interesting work but done at Microsoft basically very similar to ours of publishing the same conference but they also allowed space to change the trouble with allowing space to turn no one could watch the video we need some stability so if a squirt with time at least we should screw up space so we keep space in text so what we do we have object base and we manipulate the time of object. We don't do fast forward of objects so we can do for us. Whatever we can do but this is not part of the method and because we change the time of object causality causality is not kept. And we should remember when we see summarization objects don't appear in the right order that they appeared in life. So there's no causality it's a loss. That's what we lose when we build this summer. Zation and again and shown you this before. How do we do this. So basically we have the space and volume of a very long video we have object that to Piri different times but in this case the pin different tub a different place a person is walking on the ground the bird is flying in the sky. Why not show summarization where they're both shown similar tennis Lee and since they each occupy different part of the picture. They will not overlap so we just when we show the synopses will. So both the bird and the perp. Together and we'll show shorter video and they had these to take to take video surveillance the take and take the people put all the object in a database and then when we need to create a summary. We just collect all the object and show them similar tenuously regardless of the regional order so this is for example nine hours in the original video now we chose in thirty seconds so you know I wonder we're trying to minimize collision. We're not going to unless basically you can always say there is a in the minimisation has many element one elements of what is the total length of the summary another reason. How many collision you have and you can play with the weight and collision the more weight you put on collision. The more sparse the video will be. Yes collision the frame no knowledge Emitter Theory reasoning at all. OK So what are the steps one is the take to interact object so in database then select the relevant object of the for database selection can be by time show me anything from nine o'clock to twelve o'clock show me all the red people etc and then we display them in a short video synopsis where objects from different time appear similar tenuously and then make the data like time color motion can be used to sort we can say should sort me the object from fast to slow or from and we can play a lot and this smells like a Google search. Basically you want some features I want read people's social the rich people all come first and then the less read you have you come later in the search and when you find the person you want to stop and of course. That's always a way to index when something is interesting. There is the time or age of time and we can play play they play the original video so that the text of tricking is done by something called because subtraction which is very very common every pixel basically is examined over time and for every pixel is statistics of the color of the picture is being kept like a histogram mixture of God etc and once we have a statistical model of a pixel The minute the pixel a right not a pixel a color after this pixel that has a very low probability given the the probably stick model say OK this might be a forgone model. So we say at this point. This will be a foregone model so every every pixel has the probability to be in the former model and then some kind of cleaning. We want foreground object to be in groups because object a more than one pixel and we want them to be also consistent in time. Don't flicker So after some cleaning and collection. We get the moving all object in a frame and also we do tracking in time there is actually a nice library for backup subtraction methods in and a kind of Google Code and some of them give nice results then we also use trekking we detected the object for going object in one frame another frame the next frame we can track them for example if we want to see a group of pixels in different them to be an object. They'd rather be similar to each other and this is especially important when Object cross each other so want to make sure that the object that enter this occlusion we continue with the same object when. Switch to a different object so we use shapes similar for tracking we use common field to make sure the motion is similar nothing special here and we end up basically having we call a cube stupes our object a long time and this is how what you look like on the top is a video. So the illuminated part is the part that is part of an object in each frame and if you compress the time into one object so this is all the instances or the different type of an object as shown here. So basically an abstract object is something that you'll be in in spacetime. Then the picking it we have the set of cube in spacetime these are again from view from above the tube in spacetime. We want to pick them for two purposes one purposes is we want to pick them to get shorter video and we want to minimize the overlap among the new we don't want collision among object in the pic video. So it will get basically there is a cost function the course function and basing the temperament because time shifted to vote for every object. What time will be the time it will played in the in the resulting video and this is kind of to show as many videos in as small a time and then we have as much activity as shorter time and here we have coalition cause we want the object not to collide with each other but it doesn't have to be strict because sometime an object can be a long time and we would like a local Asian and this is temporal consistency for the leaf we import for us that object that appear earlier would appear earlier of also in the summary. Then we can put the larger. Weights and find better the people we can play with to get to get the different summarization and actually the minimisation of this function originally we did it using simulated I mean it's a very difficult problem but when this was turned to a company the company said well simulate an inning doesn't work it's too long people wait to look for the summary. But it's some kind of greedy algorithm that work. Almost as well as you will and you know what the videos were a little longer. So I see where there was some some kind of coalition want to tolerate if we do see where they're leaning we can get a shorter video with the same amount of collision compared to exactly that. But of course you know. OK so let's give an example for example this is a video this is what still didn't do the reverse of the Pause for video summarization So there was a student coming from the left and then some come to the right and this girl this is the she's coming for a long time coming and going and this is a poor rigidly Someone come from the left and disappears someone come for the right and disappear and then for a long period. She's came in and left basically to do this one to do it to shift this person in with then this put this sideways and the summary is shown here is that she is coming in. Well she's taking her water everything else happens and so basically show everything in the wrong order. But in the much shorter time. So what the show now is we have object at the pier ocurred different type of piercing maintaining Asli and we can take object that appear very long time and break them up into pieces and this is an example if anyone knows Michael Cohen from Microsoft. That is Dr Linda and he can look at this for a long time for us it's too long so we want to try so here she is in the one third of the time we see all she did and the monkey on the monkey bars. Again we can look camera special card when a school to the one of the color of tape we can consider the stationary. So this for example in the mosaic. So we have how many copies are here. It's funny it's hard for me to count when it's moving when the stationary sees it. But when it's moving it's four or five. OK then we have to create a big round we have to place the object on the background or region if this is a short time if it's five minutes into five seconds then you can still one still image can be a big one but whatever to do a full day. If you do full day so we have basically our background is some kind of a fast forward of a real big run for example this to be a back ON THE because it is all the object without the moving object. Now how come we see color that we see car there because the sky were parked in the morning and then left in the evening so we do statistics of the of the big round after some time few minutes parking the car becomes part of becoming so the background does change does change with the amount of parked cars that change with the illumination and this is a movie showing. OK So this is a bit on showing the background just the background and we see that we run faster. Night because if you are frames at night. So this is the background video on which we place the object. So when we Person object we find out the first thing that when we put object for one time on a bit one from another time we see a boundary like you see this box around the object because the object was when the ground when the background of the brain hit the brick is brownish and it really kind of disturbed the viewing so we'd like to blend we found that the person editing kind of will do the best so basically takes this blending go if we take this background. This is after porcelain if you see it. Each object here comes with its different background another with puts a look. Get the right. So it's stitch well. And if this is this is a big this is kind of one of the videos this is the the the synopsis. If we can let's just say look at the synopsis. This is one frame of the synopsis. We have three object we know that the three object for different time we know this by the shadow. So one object is a long shadow probably from the morning on the evening another one. No shadow at all. Probably was cloudy and another one of those should wonder if probably at noon so but we know that these three objects never saw each other in the real life and but when we want to know what happened. We can select an object and and it will point to the original video where this object was shown kind of to use cases that I like to show is one use cases is super. A market and basically a car here. What was hit and the client came to the security officer look you have all these cameras in the parking lot. I want you to find who hit my car and he asked How long were you in the shop set for an hour and a half so I'm going to watch for an hour and a half of video then you realize that just a week earlier they bought this video summarization said OK you went to the video summarization summarized this hour and a half into about a couple of minutes. And here is a playing of the summarization look at this car. So you see you know boom someone see the car and then goes and park somewhere in so it was very happy that after twenty seconds we found rather hour and a half we found the car the hit that ran out of the parking lot the car that it was still parked there waited there for the driver and exchanged the details so it's interesting that it doesn't really have to do terrorist doesn't have to do two weeks in can be one hour in a parking lot of a supermarket. Then if you want to spying on your employees when they go to lunch break. You can take a very short video and see who goes where you see the collision effect this guy with the green he stand there for it all the time. If you want a local lesion then we could summarize it on. So we're localism that no way so you know they don't talk to each other is waiting there for someone so this is an example of a summary and therefore now we watch for your collision OK we'll just skip the. Actually one of the thing is the measure similarity is a similarity can be in shape and trajectories of it's obvious. So I like to skip. OK So what do we have a lot of major data that we accumulate on an object one make the data is obvious if the color is the trajectory is a speed. It's set of safety features that we have on an object and another is less obvious it's something that we need to do is if the person is of the car if it's a person maybe what's the name is it the man and man or woman is it a hundred or is it a Toyota. So everything goes into what's called meta data make some kind of symbolic information that goes to the object and then we can then select based on this make the data what we would like to see and basically the best thing is to run the closest store description first and then less so there's an example of a video made at some percentage of any university I don't remember which university this is and this is the video mostly students are crossing from one site to another site and again if you watch it you can fall asleep very quickly and assume you want to do summary. This is the synopsis so this is a summary we see all the object together and we look carefully you can see the problem of object the texture of tricking you can see objects kind of disappear in the middle or not but again it's very organized. When you see we have something in mind for example assume we want we want to find the fastest object first. OK so we get the full. This. Nope sorry. So what is the fastest first we see the bicycle are the fastest objects and the little what's the next fastest after bicycles and then to go father sends bicycles and the next. It's running so we can sort by speed until we find out what we can sort by color for example to the Reds or first we see the object that there are a lot of reds and then we see less and less red you'll see bugs for example you'll see some some with no read at all. What's doing there are not true. See here these two girls have no rent at all. So good that whatever Google sorting is worth something. You see you don't know what coming from we can also see a similar to so you select an object and see pick me show me obvious similar to this you know guess what similar similar is the trajectory. It's a is that the ration the direction and the speed and also a set of sifts that are there and the first object here is the object that we selected the similar and then we picked up all the girls with shorts and go in the same direction. So now maybe skin color was the dominant Of course it's natural that the speed in the reaction be. The same but also maybe the skin color and it's a few Pruitt longer we'll see man we'll see a little other people. OK so I'll end here and thank you. So the question is if we can analyze the money political video and we accelerated once made to the other and find if this was done and most of the time I think the answer is no if especially the swimmer what happen is that we select how do you do it. You do it. Basically you do it with two things where you do you do graph gets to cut in the places that you want to feel and then after you do the graph got you do some kind of blending so there's no way you can do this now if the cut his to go through objects for example if the cuts when we do this we met we did the first the cut without graft cuts. So we were cutting the middle of a hand of a person. It means that you do like this. So you take this part of the head from one time and the other part for another time then the hand will have a funny shape because it's the NOMIC But when you do it with graft cuts very tough very hard to find anything actually one of the one of the best reviews a God of the people came for us to do for Encik analysis of videos and says Can you guess videos of krill your videos are most difficult to analyze for is and for us if you want to see if we have manipulated the video and created at the fish we did. That's not correct then there was a new reader the most difficult to analyze forgive provide them with samples so that as a compliment of what went wrong. I'm like yes yes yes cellos is a. Big problem. And of course the big thing is to say well let's ignore shadows right now we don't ignore shadows because we don't we haven't done it with a warm motion detection on object attacks was only made on the fact that the object moving. Of course the most important on is person. So we'll try to do person recognition in which case we could ignore the shadow right now we don't know the shadow and you saw in the in the example of social we see three objects at the same time with different shadows and then you know that the three object in four different time even if we show the object without the shadow you know object have said to write. How can we have object without them. So it's we just tell them that they should remember that it's time is not correct and that So now I must tell you that I never look at these videos right. So I don't know what the reaction to give to people who watch a video and actually one of the things that we didn't realize is it helpful or not maybe by putting many object together just confuse the analyst and feel better to see one object after another rather all of them together we don't know the answer but what the F.B.I. did the F.B.I. was very interested from the beginning in this so they took the softer the plate with a couple of months. They didn't take what the result is but they bought so perfect that they bought after a month and means that they found that it may be faster to analyze the video after the summarization than the before but they don't tell me then they only get information they never. Volunteer to say anything even if you ask them their views to say but basically when there is no power locks the issues paradoxical when you have products. You cannot tell between a moving object and the parallax motion. Now if you have object recognition. You say OK I want people so human you will find the people even if there is part of the product or not confuse you. But then the question is on which copy and the pilots will change the background check which I show in which copy of the back and we'll put it so it's not that obvious so we're trying now are working on wearable video I mean where the video and I NOT mean the student where videos and that they walk around and then the question is What do you do when we can summarise this using this method when you walk around. Maybe when you sit in the office that and students are coming in and out in and out you can summarize this but that any other more interesting activity is more difficult to summarise actually if you can do three D. positioning then you can plant the object in the right thing and the really good my building room in the day you can put the object on a three D. model but this is much too difficult. I believe then for something can be done quickly used a summarization we need to be done fast.