[00:00:05]
>> For our last seminar today delighted to have Jay McClellan joining Jay started his training with Pete at the University of Pennsylvania and then started faculty career at University of California San Diego to move then to Carnegie Mellon where he was one of the founders and co-director of the Center for neural basis of cognition kind of a pioneering partnership between 2 different universities I know many people here are very familiar with now is that Stanford based primarily the department of psychology where he's aloof he's turned professor of social sciences a luminary in the field through the kind of early pioneers of connectionist models explaining cognitive phenomenon with neural networks really pushed along with his colleagues the development of that area and you can draw a straight line between that work and the resurgence of neural network models and the excitement of them in she mourning today certainly recognised broadly for the contributions of this work is a member of the National Academy of Sciences a look this morning work has been cited almost a 100000 times.

[00:01:06]
A list of awards as long as my arm the one that I decided to pick out to highlight this morning the David remark Prize for his contributions of the theoretical basis of cognitive science so very pleased to hear I think it hit him especially important interception given the role of neuroscience and computing in machine learning here and technology driven campus like Georgia Tech so very pleased to hear thank you so.

[00:01:34]
Yeah. So I've been trying to integrate mine brain in computation for a long time I guess. And I definitely want to acknowledge the importance of the contributions of people like Dave Romo hearten Jeff ensign but also experimental psychologists whose work was very influential in getting me going neuroscientists like Bruce McNaughton who have collaborated with and now a theoretical physicist named Syria Ganguly who does amazingly insightful mathematical analyses of actual neural circuits and artificial ones as well and his contributions will be featuring in this in this talk today.

[00:02:19]
I want to start with some very very basic premise sees that I work with and this is my I guess most fundamental premise your knowledge is in your connections. So for me an experience produces a pattern of activation over neurons in one or more brain regions the active pattern that we have as we look out at the world or as we're reading a sentence and understanding what it means or in any other situation is is a state a pattern of activity but what allows us to have those states and for us to imagine and move from one state to the other is the knowledge that's in our connections and they allow sensory information to produce representations of an understanding of let's say a chest position or something like that or an anticipation of what might happen next and all that is learned acquired knowledge that's in your connections so.

[00:03:22]
This lecture is largely about memory and so we need to have an initial sort of agreement about what we think a memory is and for me it's the trace left. The trace that's left in our nervous system after an experience is just a set of adjustments that were made to connections.

[00:03:45]
When we had the experience or as a result of the experience and so you know I think of each such experience is leaving a trace but you know this little neural network here could have many many different experiences and each of those experiences is going to result in adjustments of essentially all of these connection weights and so there's no sort of separate repository of each individual experience they're all superimposed in the connection weights memory traces are not separable are distinct.

[00:04:20]
So if that's true. You know recall isn't a matter of pulling something out of a. Door like a book in a library as people used to think about it right recall is a matter of having a pattern. Evoke another pattern or produce a pattern completion effect so we can think here maybe half of these neurons are coming from the fusiform face area and half of them are coming from a phonological area in the brain and you know this the set of neurons is representing somebody our visual impression of their face and this is representing our visual impression of their name and what happens when we retrieve their name from their face is that somehow or other we use the connections that have been established between these neurons on to these neurons.

[00:05:15]
When we experience disassociation between the name in the face and now we're going to use them again to reconstruct that pattern the corresponds to that person's name. So the key thing about this is that this reinstatement again it depends on the knowledge that's in the connection weights and in general this knowledge is going to.

[00:05:43]
Reflect the influences of many different experiences you know this is sort of like for me there's like no such thing as the possibility of a ridicule memory it's always the result of a construction involving utilization of the many past experiences that have adjusted the states of our connections prior to this new experience that then makes further adjustments on top of those and then when we have a temp to recall or retrieve were all the knowledge that's in the connections including the stuff that came from this Pacific Association is going to play a role in this process so we would never think of memory as the ridicule process it's always a constructive act.

[00:06:30]
OK So this is a fundamental starting place your knowledge is in your connections. That's a key starting point but there's a 2nd very important starting point for this particular talk which is that our brains have evolved to have. 2 quite different learning systems both of which according to me are using connections right but with slightly different or maybe very different parameterizations and really learn something about this from the famous neuropsychological patient a chairman from other studies of the effects of brain lesions on.

[00:07:17]
Memory and learning So this slide is actually includes a figure. That was published very early on right after H M M's lesions so H.M. was a neuropsychological patient who. Had intractable form of epilepsy and prior surgeries by Scoville surgeon in Montreal had excised one of these medial temporal lobes from the from patients and that had had an effect on their epilepsy without much effect on their memory H M's epilepsy was so intractable they said well we better take out both this time so in a Chance case they took out both sides.

[00:08:06]
It did relieve his epilepsy but it produced this profound memory impairment but it's interesting what was left and what was not OK so. H.-M. exhibited in tact performance on in tests of intelligence general knowledge his language and other acquired skills he could tell you the meanings of words you could read on irregular words aloud just as well as anybody could he had knowledge of the history of the world and of his own early life.

[00:08:44]
And in general could carry on a reasonably intelligent conversation with you the only thing was he couldn't remember that he'd done so so if you walked out of the room and came back 3 minutes later he'd look at you like he'd never seen you before and he would start telling you the same stories again so he had a lot of knowledge right a lot of the knowledge in his connections was still there but his ability to form new memories was like fairy very profoundly impaired he couldn't explicitly remember previous episodes or events he couldn't learn to associate one arbitrary word with another somebody tried to teach him a list of 5 arbitrary word pairs take this item let's say a locomotive and remember the word dish towel with it couldn't even learn 5 pairs after 100 repetitions of this.

[00:09:40]
He couldn't even remember that he'd done it from one day to the next new factual information did not stick he didn't remember new facts that you would tell him. But you could present an item. And later you could present a degraded version of that item and he would recognize it.

[00:10:06]
More readily than he would have if you hadn't presented it before. You could use let's say for line drawings OK of like. Various kinds of objects sees the line drawing and then you present fragments of the line drawing that he's seen before he could recognize it better than if he'd never seen another set of pictures that you didn't show him before OK So something's happening in his brain is somehow being tweaked to leave some residue of that prior experience but he can't tell us about it except that he can recognize the picture more readily with fewer segments.

[00:10:44]
Also he was able to learn new skills such as being able to. Draw while looking at his hand in the mirror instead of looking directly at his own hand while he was drawing this is a hard skill to acquire we learn it gradually over repeated practice we didn't want to go the wrong way when we're seeing our hand move in a certain direction and we want to make a line segment in a certain direction but we gradually get better at it he got better at it than a normal rate he just didn't know he'd ever done.

[00:11:12]
Other tasks like that so some kind of learning is spared but other kind of learning is not in lots of knowledge is spared also he showed a retrograde amnesia that is to say he lost his memory for events that occurred within a period of time prior to the lesion so how do we think about this.

[00:11:37]
Together with Bruce McNaughton and Randy O'Reilly originally back in 200-1905 we formulated the complementary learning systems theory so I was told I should point over here and. We recently reprise this in a in a paper in terms of cognitive sciences but. Vic the key ideas are are very simple.

[00:12:03]
So the starting place is the idea that connections within and among near cortical areas which I've shown here in green support gradual acquisition of structured knowledge through I like to use the phrase interleaved learning that basically what that means is that as we go through our daily life we repeatedly encounters many different things so you know each morning we may have a bowl of cereal in the lunch we may have a sandwich and dinner we may have you know prepared meat with a couple of vegetables and you know this happens again the next day so we learn about one breakfast lunch and dinner and then ensure we fashion gradually over extended time and we talk about different subjects at different times of day we go from one class to another we gradually build up our knowledge through this sort of interleaved exposure to all these different aspects of the experience and our Neil cortical neural networks gradually learn this stuff through a very slow learning process that takes place over what I like to call developmental time.

[00:13:09]
And by the time we get to you know middle childhood a lot of that has already occurred but it continues as we as we continue through school and on into our adulthood these connections are still gradually changing as we have continued exposure to a multitude of different kinds of experiences OK.

[00:13:34]
Then But if we're going to learn something new. We meet a new person who we've never seen before and somebody tells us that person's name in some situation there's the there's the situation that's represented in part of our brain is the person's face that's represented in part of their brain is a sound of the person's name that's representing part of our brain and in order for us to.

[00:13:58]
We would make little tiny adjustments to the strengths of these cortical connections based on this one exposure and according to me we might even gradually learn to recognize this person over repeated exposures but cortical systems learn slowly and so if we want to show any. Sense to me this after a single trial we need to use this rapid learning system based in the medial temporal lobe this is the part of the brain that was removed in H.M. to essentially take the various elements of this cortical experience which are projected into it rapidly bind them together by making big adjustments to connections.

[00:14:42]
And then when the person's face is represented to us it kind of provides a probe back into this medial temporal lobe system and a pattern completion process occurs kind of like I was describing before to bring back to mind the rest of the episode in which you saw this person and perhaps also help you retrieve their name now obviously the system isn't a perfect one shot learning system we often need more than one exposure to learn some new thing but you know with a few exposures to a new person or talking to somebody for a half an hour we begin to get to know who they are certainly would recognise that we've met them before and begin to know something about their name and that depends on the system if it's still there later when you see this person again you'll be able to do this pattern completion process and bring this back to mind if it's been removed of course.

[00:15:37]
And you just recently learned it it'll be gone right that counts for that retrograde amnesia part. And a key idea is that while this knowledge is in the medial temporal lobe in the connections in the medial temporal lobe. We occasionally get the chance to replay it to reactivate it play it back to the cortical system so the cortical system gets additional learning trials now from my point of view.

[00:16:12]
We want to think about those kinds of replay of them as potentially occurring during things like sleep and there is good nor scientific evidence now that there are replay events during sleep. Bruce McNaughton's lab has been central in evidence for that but also they occur when the information is needed again we see that face again the hippocampus helps us bring that back to mind it provides a training trial for the cortical network the protocol network gets the retrieve the result you don't need the person to say their name to you again because your hippocampus gave it to you but then you have a new learning experience in the cortex and so gradually the cortex can build up this knowledge over Ripley to repeated opportunities to reinstate these patterns in the cortex where the changes also occur OK So this is the complementary learning systems theory that's based on this connection this approach her take is the idea that there are these 2 complementary learning systems now.

[00:17:17]
So the key points here are we learn about the general patterns of experiences not specific things gradual learning in the cortex builds simplistic some implicit semantic and procedural knowledge that forms much of the basis of our cognitive abilities the hippocampal system complements the cortex by allowing us to learn specific specific things without interference with existing structured knowledge in the cortex and in general these systems must be thought of as working together rather than being alternative sources of information I think again going back to that idea from the very 1st slide when we're retrieving some person's name you know if they're from a culture which uses a form logical patterns that we're very in for milieu with that's actually going to be harder than if they're from a culture where we're familiar with the phonological patterns because the cortical networks sort of have the structure into which just a little bit of hippocampal learning can sort of tweak things to make this particular pattern of this particular Pakistani last name.

[00:18:20]
Work out but if you have no experience whatsoever with those kinds of cultural patterns. Patterns of that language culture you won't be able to the same amount of hippocampal learning won't be enough right so the general notion is these systems are always working together. So in the next few minutes.

[00:18:44]
I want to give you a kind of an intuitive characterization of some of the work that I did over an extended period of time. Sort of built around these ideas and then I just want to say that after this more intuitive part is going to be a part where I'm going to try to introduce you to a more quantitative theory about some of the same phenomena that we're going to go through this more intuitive presentation.

[00:19:15]
It's a lot to get through in the amount of time available this is why I may be rushing a little bit up here but I hope to give you at least a flavor for both aspects of this work and so here goes. So the next couple 3 minutes I'm going to be talking about how this sort of cortical knowledge gets gradually built up over developmental time and the key point here is that distributed representations what machine learning calls embeddings the capture aspects of meaning emerge through a gradual learning process I guess the ability to form these distributed representations is what I ought to say but we're going to look at the representations themselves as they evolve through this connection based learning process and.

[00:20:05]
I actually wrote a whole book on the subject with the former Ph D. student Tim Rogers the progression of learning and the representations form captured many aspects of cognitive development over the 1st 10 or 12 years of life that was basically the focus of our book I only have time to touch on one basic aspect of this which is the progressive differentiation of conceptual knowledge that occurs over middle childhood so let me just touch on that as a psychological phenomenon briefly with you this is part of the inspiration for this whole theory actually.

[00:20:41]
So. This is that a psychologist named Frank Kyle a developmental psychologist collected as part of his dissertation research. He was interested in what children thought was Otay or silly to say about things so is it OK or silly to say that a man is an hour long while most kids thought it was silly.

[00:21:07]
So basically But if you said is it OK or silly to say a man is awake yeah that's fine you could be awake you could be not awake that's something a man could be so there's nothing silly about that right so if the kid said it was OK then the predicate would would end up above the.

[00:21:28]
The noun in these trees and if the kid said no it's not OK then they be separated in this way he was able to construct what he called predicate bill of the trees and. Basically what you can see here is that as kids get older these are represented in individuals rather than averages but they are fairly representative of the trees become more differentiated so in particular younger kids tend to lump all sort of psychological and physiological predicates together but later on we tend to differentiate some of the psychological and physiological predicates reserving some of them only for humans and not for non-human animals so this is a progressive differentiation process so I was thinking about Frank calls work and I heard from a heart come and give a talk at Carnegie Mellon shortly after I moved there about this very simple neural network model in which the knowledge is in your connections.

[00:22:28]
That was offered as a counter to what was then the prevailing view that week store are semantic knowledge as essentially explicit propositions in a hierarchy Clee structured tree like this one. OK So this with the prior idea our knowledge of living things would consist of. A set of concept units each of which would have a link to it various relations in that the other end of the relation would be.

[00:22:59]
A completion like a pine is green a tree has bark. A canary bird can fly a canary can sing and then there are these is the links which allow inference if you don't know something at this level you can say well a pine is a tree and a tree has barks or a pine must have bark OK so this was the prior theory and all had said let's build a neural network the uses connection knowledge to store this kind of information so.

[00:23:36]
The heart's network consists of input units one for each of several different concepts and other units which you can think of as capturing the relations in that tree he was specifically interested in the connection this rendering of the knowledge in that particular tree and on the output side we have units the correspond to the completions of propositions involving the item and the relation or the item in the context if you like so in this case for example a robin can grow move and fly and so the neural network is trained with inputs like this Robin can and that it's taught that the correct answer to this is the correct set of possible completions is grow move and fly so that's a training example for Dave Lamar the training samples were all the things that are true of these items integrating across the levels of the tree.

[00:24:38]
And so. There are just a total of $32.00 training examples one for each item times context and he trained this thing with interleaved learning. So if you look at the patterns of activation at the representation layer here through these are fully connected weight matrices by the way in case you're wondering they don't I didn't draw all the connections in there but they're fully connected.

[00:25:12]
They gradually evolve over time as the system learns and we can actually look at them as histograms of of the activations of the units at different points in learning so an epoch is a sweep through all the examples one set of interleaved experiences basically and that's how we train neural networks right we used it's really learning all the time.

[00:25:34]
And anyway so early on the network hasn't really learned anything the initial representations even after 50 epochs. Are basically based on the initial random weights that were in the network and so they don't actually capture the structure there but if you look at 150 appox you can see that in fact these patterns have differentiated quite a lot so that now the 2 fish are quite different from the 2 birds and vice versa.

[00:26:03]
And then these animals are all quite different from the plants there's been less differentiation among the plants but still the trees are more similar to each other than they are to the flowers and vice versa right so and then if you look in the middle. Well we could use hierarchical clustering analysis to look at this again and what we see is that early on there's really no differentiation and things aren't grouped appropriately anyway just due to the random weights and an intermediate point the network has separated the plants from the animals but not further differentiated things and later still it's sort of more fully differentiated the entire structure of of the domain so this is a phenomenon called progressive differentiation and it results from.

[00:26:52]
Interleave learning and it captures this aspect of development and like I say Tim and I wrote a whole book about this I don't have time to say anything more about that book but we captured a lot of aspects of child development of their conceptual knowledge of living things and.

[00:27:07]
Other kinds of things in that work so I was quite convinced that it's reasonable to keep thinking about this sort of gradual cortical learning like process as one that kind of characterizes that that knowledge that's intact after we remove the HIPPA hippocampal system but what happens in the system if we try to learn something new so in our 95 paper I considered the case of a penguin and.

[00:27:39]
So what did I do to explore this well I simply use this network that R.V. been trained with these 8 items so it already knew about the 2 birds in the 2 tree 2 fish and the 2 trees in the 2 flowers and I just added a new neuron for this new input unit I tend to think of these as.

[00:28:02]
Sparse orthogonal distributed patterns that are sufficiently orthogonal from each other so we're not relying on any similarity structure but one hot input units. Captures that same idea so that's a theme with simpler implement so that's what we've done. And then. Train the network with these 2 items penguin is.

[00:28:30]
The correct output specify for that is living thing animal birds or penguins are technically birds but the for the other pattern it was trained with Penguin can grow move swim OK so this was the entirety of the network's experience 2 different patterns one penguin with and one penguin with can and I just tried to teach this so what happened when I learned this new information quite rapidly.

[00:29:04]
So this is a measure of the absolute error averaged over the output units across the 2. Items that it was trained with as a function of repetitions of those items but the consequence of this was it interfered with all the existing memories of the other birds in the other fish and actually previous demonstrates is where the word catastrophic from the title comes from a previous demonstration that neural networks would show catastrophic interference had previously been done by.

[00:29:40]
A psychologist at Johns Hopkins and Neal Conan you're a psychologist who showed that you know you tried to teach a network a set of. Associations and then you tried to teach it a. Different set of the CA sions you could actually completely wipe out its ability to answer the 1st of his associations before it would even start to learn it.

[00:30:03]
So. This finding. Helps us to understand why we have these complementary learning systems. We do need to learn new things quickly but if we tried to learn them quickly in our near quarter call neural networks they might produce a lot of this interference and the reason why interleaved warning is affective is because it lets the network sort of use the gradient of the entire set of alternatives in the the center the whole set of the environment essentially to you know find a pathway through the space of adjustments of connection weights that eventually carves out a place for the new item but this depends on in early learning so we need these complementary systems to allow rapid learning of new stuff without catastrophic interference with what we already know so that was the purpose of our 95 paper was to argue that point and we said OK look here's this general idea of the neocortex consists of all these different kinds of information about things and then the hippocampus can sort of see the states all throughout this and.

[00:31:24]
Have projections from them that then allow rapid learning to occur within the hippocampus and then once that's occurred we can have replay events occur so that we can interleave. Exposure to this new item with ongoing experience with other things and so I then build another experiment where I didn't actually implement the hippocampus I just had a I just basically said let's add these 2 new items to our training set and continue the interleave learning.

[00:31:56]
As if there was a hippie campus that might be helping us with that but I didn't implement it so in this. Set of results here the network is learning about the penguin. Interleaved with ongoing exposure to all the other things in the environment and it actually needs many more repetitions of that penguin information to integrate it into this cortex like network but the advantage is that it produces hardly any interference at all with what was already know.

[00:32:31]
So you might say gee you know I didn't fully integrate this and we'll come back to that little bit later in the talk but that's the basic idea OK initial stories in the hippocampus followed by repeated replay interleave with ongoing repetitions of other things leads to the consolidation of loop new learning in the cortex avoiding catastrophic interference.

[00:32:54]
All right however. I'm going to run out of time. I'm going to tell you about this part really briefly. So and force is a beautiful story and Richard Morris deserves a huge amount of credit but because I want to tell you about the kind of the quantitative theory part I'm just going to have to give this part somewhat short with apologies to Richard here but Richard contribution to this debate was to say look you know sometimes we can consolidate new things quickly and I know your community here involves people from psychology but also people who study learning in rodents and this is my opportunity to celebrate this because this work was done in rodents and fact McNaughton is of course a rodent researcher and so this whole theoretical enterprise it's been strongly built out of connecting from the human to the to the road based literature on these things what Richard did was.

[00:33:56]
Hopefully I can convey this rather briefly he trained animals. On a flavor place Association task. And so what happened was this the animal was placed in the start box could actually be placed on any one of the 4 walls of this arena so this is the actual arena and this is scored in the schematic of it.

[00:34:18]
So the Starbucks could be on any of the 4 sides and in that Starbucks be given a tiny taste of a flavor like which might be let's say strawberry OK And then the animal was let out into the arena and he had to find the hole in the floor that had a bigger 3 bigger pellets of this strawberry flavored food so the animals learned about this very gradually over many many sessions.

[00:34:45]
But then what Richard did was he. Closed 2 so this was in early learning right all 6 of these trials occurred every day with these 6 flavor place associations and after it was learned that he switched it around so that 4 of them stayed the same but he replaced 2 of them with 2 other ones and they were actually very close by to the previous locations and I'm mentioning that purposefully here but now let's say we have a hot dog in a beer flavored food.

[00:35:20]
I don't know if they had those pellets available but that's what I found on the web so. They were close by to this previous locations and I'm not going to go through the graphs but what he found was that after just one trial with these new these 2 new associations.

[00:35:41]
He could lesion the hippocampus $48.00 hours later and then wait for the animal to recover and they would be able to perform just as well on that one trial learning of these new associations as they would have as they could even if there was no lesion so not only did they learn in one trial.

[00:36:04]
But if it depended on the hippocampus it only depended on it briefly because you could take the hippocampus out 48 hours later and they still need the information they couldn't learn new ones after that whereas the ones that didn't have the lesions could say the hippocampus had to be there for the initial learning but not for very long and subsequent research showed that and I can't go into this either but that a lot of Gene induction occurred in the cortex.

[00:36:37]
Even in the 1st 80 minutes after that exposure if these new information was given in the For 1000000 or environment but if he gave it in a novel environment that the animal had never had experience of before there was basically no learning in the cortex so these are 2 amazing science papers that showed all this again I've given them very brief.

[00:37:02]
Presentation here and he actually held this up as a refutation of the complementary learning systems theory because it seemed to indicate that in fact the cortex can learn new things rapidly and we said it was a slow process so I realized that I'd miss communicated about my theory.

[00:37:23]
And so. I went back and I thought more about this and. I kind of engage with Richard's ideas within the context of the example that we've already seen before so the basic idea here is that we're interested in you know. What does it mean for something to be consistent with what we already know so here I'm thinking of a you know this is the this new knowledge is quite consistent because it's like even places that we already know about specifically in this in this very familiar environment.

[00:38:03]
But in the context of of our knowledge of living things you could think of it is a new item that you could add into the neural network without disrupting the structure that it already has and. So the penguin isn't really like that because it's not a bird and it's not a fish really it's sort of halfway in between so in order to get it in there you have to carve out a new category on within those.

[00:38:31]
Those animals but if you try to learn about a new trout which is just kind of like any other fish or a new bird that's kind of like any other bird that would be schema consistent you just in this diagram you just added in the under bird or fish right and that would be that so I explored this again I want to cover this briefly using the same approaches learning the penguin initially only this time in addition to replicating the penguin experiment I had to experiment with a trout or with a cardinal.

[00:39:06]
And so there are 3 separate experiments I'm going to average over the trial cardinal because they produced very similar results but what did I find well the 1st thing I found was that the initial learning even focused learning OK without any and relieving was very rapid for the things that are consistent even though it's completely arbitrary in some sense like this new input item is orthogonal to anything else you know you have to learn that it's a bird or that it's a fish but it's easy to learn that because you drop straight in it fits into the structure.

[00:39:38]
Whereas the Penguin It's much harder to learn and the do thing doesn't produce much interference. If it's fully consistent whereas to get the pain within there we had to get that interference if we interleave. So that's that's the basic results yes we interleave of course we can avoid the interference with a pen we don't even really need it very much with the with.

[00:40:05]
With the things that are already consistent they a few repetitions would be enough and there wouldn't be much interference OK So this is the part that I'm going to just skip. 15 MINUTES good OK so I'm going to be able to tell you about the theoretical part in my remaining 15 minutes here.

[00:40:27]
So there are 3 things I hope to be able to touch on briefly are all aspects of new learning integrated into core tech like networks at the same time. Is it possible to avoid replaying everything one already knows one wants to learn new things. So these are the 2 questions that I'm going to have time to touch on very briefly I think the sort of launch the opportunity to do a lot of ongoing explorations I probably won't be able to say much about those but you'll kind of get these 2 ideas and then maybe they'll inspire you to do something to follow up so in order to make progress on this we're going to have to.

[00:41:06]
Have an explicit mathematical theory and this is where Syria Ganguly came in. He and the Ph D. student and I have a paper that's going to appear in P N A S soon in which we lay out. These 1st 2. These 1st. 3 elements here and then the stuff I've done just extending that work that we started with Syria and Andrew is what happens when we learn this additional item but the basic theory here was developed primarily by Syria and and then we're going to maybe talk about the consequences of this so 1st I have to characterize the structure in the dataset to be learned.

[00:41:50]
Then we have to think about a network that can learn that structure then we think about the dynamics of learning and that's where we'll get to be able to answer these questions so here's the dataset. This is pretty much the same as before we now have a sparrow and a hawk for reasons that will become apparent later but otherwise they're the same items and even think of what you're looking at here as an item property table where black means this item has that property and white means it doesn't and so you can think of these as 4 properties that all animals share with each other and these are 2 properties that birds have the fish tone and nieces 2 properties the fish of the bird and these are 2 unique properties of the sparrow like maybe it's small and meek at least a few properties of awk like it's big and fierce.

[00:42:43]
And then these are unique identifying individual eyes features over here. And here of the plants over there on the other side with non-overlapping feature so this is just a dataset OK Item property matrix and what we're going to do is we're going to do a singular value decomposition of this what does that mean we're going to ask ourselves what is the dimensions of of this data what are the if I just knew one split let's say this is what's.

[00:43:18]
In the cells if I could just split these 2 things on one dimension how much could I account for a lot of the stuff that's in here and the 1st dimension of the singular value decomposition indeed is something that says in this case is this a an animal and if if it is an animal and there are 4 Hannibal's then it tells you that it's got these properties it's likely to have these and less likely to have these other ones and.

[00:43:50]
You know if you take the product of this classifying variable times the strength the factor times the feature vector that gives you this picture here which is the content of a dataset that's explained by this dimension OK so the singular value decomposition has 8 dimensions to it one for the animals one for the plants one that differentiates the birds from the Flip fish one that differentiates the trees from the flowers then one that differentiates the birds from each other one that differentiates the fish from each other when that differentiates the trees from each other and when the differentiates the flowers.

[00:44:35]
OK So these are the dimensions of this data set and what you saw before in the progressive differentiation simulation was that the neural network was picking up on these 1st 2 dimensions very or looming learning and then later on it picked up on these next 2 and it was beginning to show weak sense to Vittie disease when the simulation stopped.

[00:45:04]
But but that's the structure in the dataset that's just about the data why does the neural network exhibit sense to be to the structure OK So this is where the work with Syria comes then. Together we have analyzed I should say they have analyzed deep linear networks so the minimal 2 layers of weights from input the hidden hidden to output.

[00:45:33]
And. The 1st observation of about this which I'll make extremely briefly is that. Once you have. At least 2 layers of weights the learning in any one set of weights depends on what's already known in the other set of weights so this is the reason why knowledge is essentially.

[00:45:57]
Gradually built up over developmental time. Actually there's these new mutual dependency Sieur So you have to start learning in one set of weights before you can learn in the other and of course you're learning in both of them in parallels you know it's kind of a a bootstrapping kind of process so.

[00:46:17]
We're going to look at this in a simplified version of the royal heart network where instead of having these item these contexts here we're just going to flatten the output across the whole thing and we're going to ask for each item what it or its features and this is our neural network so linearized network with 2 fully connected layers of weights OK so the dynamics of learning of this are follows you can see it here.

[00:46:48]
The network essentially learns each dimension on its own timescale. And follows this sort of sigmoid approach to asymptote. With say so the ultimate strength of the dimension is just the singular strength of the dimension itself so the strongest ones are higher on the curve for that reason these are just the singular values that I'm plotting over the course of learning but they kick in at times that depend on the strength of the mention and it depends on in several ways sort of the time till takeoff as well as the stinkiness of the acquisition as well as the asymptotic level they all depend on this single strength parameter.

[00:47:32]
So this explains this progressive differentiation phenomenon and many other phenomena as well that Tim and I had covered in our book so now we can get back to learning something new again. And. Now it's going to be a sparrow hawk and I picked this because it's within the bird domain but it's going to cross up features of the sparrow in the hawk so instead of being small and meek or large and fierce it's small in theaters.

[00:48:05]
And it's got its own unique distinguishing feature here as well that's new so. We can now do a see your value the composition of this complete dataset. And what you should be able to see here is that. This new item this is sort of done from scratch right with a whole new dataset with the one thing.

[00:48:29]
Is becomes part of the category of animals and of. Birds rather than fish and also adds its own unique new dimension. It colors the. The animal plant dimension because it contributes to the overall statistics of him Alyson it colors the bird this dimension because it contributes the overall statistics of birds but it also requires a unique to mention to be able to differentiate it from the other things that are in the network now the singular value decomposition would ordinarily rank this above these lowest 2 guys here but for the sake of comparison I've just artificially sorted it to the bottom of the list you can see that its dimensions strength is actually a little stronger than these these guys over there so this is just the S.T.D. So what's going to happen when we try to learn this with interleaved learning Well we're going to see something that looks like this.

[00:49:38]
Where very rapidly the network learns. A lot about the spare heart but then it hits this plateau and while it's in that plateau there's interference with the birds. Not so much with the fish. Or the plants so this is this is similar to what we saw before with the with the penguin but I didn't analyze it in quite as much detail.

[00:50:07]
And. If we look at the singular value decomposition what we see is that right away the network is learning to adjust the S.P.D. so that the strongest dimension is. Has been adapted to capture the new overall statistics and the plant and the bird fish one also has been adapted to capture the the new.

[00:50:35]
Statistics but this one plateau is essentially associated with a period of time when this new dimension hasn't been integrated into the system yet because it actually still has to learn that at the same slow time scale as it would have had to learn and if it had been there from scratch and it's got to have that interleave learning with the other relevant stuff to learn this new dimension OK So so the key to the point of this part of the talk is that.

[00:51:04]
We can learn some aspects of new things really fast but not the really arbitrary new stuff which is kind of specifically conflicting with other things and we can even integrate that stuff into the cortical network quickly so back to Richard's experiment he put those 2 new flavors right near those old holes in the environment basically I'm thinking is just mapping them on to old structure and that's easy it's just sort of.

[00:51:32]
Mapping it into a learned space but developing a new dimension in your space is harder OK. So now in this next simulation what I simply did was I. Froze the weights from the hidden layer to the output side so that the only thing the network could do is adjust the way it's coming from the Sparrowhawk into the hidden layer for the sake of seeing what happens I ran that till this point and then stopped it and then continued it after that but we're going to look at the results of analyzing it at that point there and what you can see is that.

[00:52:13]
The networks output have for all the existing things are perfect like they were at the end of learning about them and what is learned about the Sparrowhawk is to basically say Yeah it's an animal it's a bird and I'm confused about within the birds what properties it actually has it's sort of like the average of all the birds that's what it can learn just by learning how to project itself into the existing structure that's easy.

[00:52:42]
So maybe our quarter cortical systems can learn that kind of new structure fast. But. They can't learn this. Other aspect as quickly. OK so the last point I want to make and I realize I'm actually going to be able to make it so that's good. Is. To talk about how we might be able to increase the efficiency of the IN THE LEAD learning process this is.

[00:53:20]
Something that Bruce McNaughton Actually Bruce is responsible for all this work is he asked me about the similarity waited in the lead learning and I started to think about all these issues again but here's the idea since the interference here in this full interleaving was really only occurring with the birds anyway maybe all we have to do is interleaved the birds and not really worry about everything else well turns out if you don't even leave anything from the fish it doesn't work so well but a little bit of interleaving of the fish you don't have to worry about the plants at all so we get a savings here get virtually the same results with 3 tenths as much interleaving because you're only interleaving the birds a little bit of the fish and then none of the plants so it looks exactly the same but the amount of interleaving Per is much less.

[00:54:15]
And if you compare this to a sort of a uniform interleaving condition where I just interleaved everything with a probably a point 3. You know the whole thing gets extended and it takes longer to learn and this is the total sum of squares there you can see that it's stretched out.

[00:54:37]
It's you know it basically gets to pretty much the same point after the same amount of total exposure with this full and leave that you're wasting all the presentation of the plants and stuff you doing there OK So summary of that by focusing in or leavings on items similar to the new item we can learn a new item with far less replay of older memories now unfortunately I have to say that this doesn't always work we've tried this in sort of more big large scale deep learning architectures for various.

[00:55:09]
And various other tasks and it doesn't the Similary weighting doesn't necessarily produce a big benefit so there's a lot more work to do there but. I know I should stop so all just show you my last slide here to wrap up. So these are the questions I asked on an earlier slide are all aspects of new learning integrated into court take like networks at the same rate and the answer is No not in the networks we studied some aspects can be integrated.

[00:55:39]
Quickly but some are integrated much more slowly than others and this makes predictions for experiments we could actually test these and variants of Richard's experiments or other paradigms and then the other question is it possible to rejoice in playing everything one already knows yes at least in some circumstances but further research is required to understand that better OK thanks.

[00:56:28]
There's. A. Web. So. I would try to do this a little bit by virtue of such things as the following. Any. And I and the soon language prediction. We. Hold out a few words from the training set in the train the R.N. and then me. We will also train it with them included and then we find words that are similar to a subset of words.

[00:57:11]
And then. You know those are the words that we heard we hold out the set of words and then we have these other ones that we know would have similar representations if everything was included all once right and then we try to do the similarity weighted interleave learning on.

[00:57:29]
This set of new words that you haven't seen before with the ones that are so we know should have similar representations to it being the ones that were in the reading. And in that context that did not work better overall than just random interleaving So we are not entirely sure what's going on with that yet.

[00:57:52]
And more work needs to be done that's sort of my last point basically But yet in that case you know you've got nothing but like the one hard for the word on the input and on the output of this R. and then that's basically doing language prediction so it's got the same one hot sort of like input but the outputs also one hot you know just like which word is it that I think is going to occur here and.

[00:58:16]
It didn't it didn't work in that context in the experiments we've tried so far. So I think. This is where I'm hoping neuroscience will actually come to our rescue and learn something about how we ought to be representing things in the world in in deep learning nowadays but this is wildly speculative OK.

[00:58:44]
I guess I'm thinking that the patterns that artificial neural networks are using aren't sufficiently orthogonal ised. So that there's there's just a lot of overlap across all the patterns the hierarchical structure I just I don't know much about how you what that structure looks like so if the U.S.B. Dean were to have cross cutting structure that picked up on features across all the items then there wouldn't be anything you could totally focus on because it would be overlapping with so take a gender feature for example right a gender feature is going to be.

[00:59:30]
A showing up in everything the plants genders are not so important but just for the sake of discussion let's imagine that this crossed the whole space Well then if I trained on a new item that involved gender there would tend to be some cross cross-cutting engagement with the everything in the entire space and so that would that would.

[00:59:51]
Tend to reduce the extent to which the similarity weighting could be so focused so we need to explore you know more about the structure of the dataset itself right what is the structure in those datasets that's very hard and deep learning so you know these little toy datasets we could we can specify exactly what it is.

[01:00:15]
But that's part of the answer that question yeah there was another question. OK yeah in the back. On the. Morning or. Evening. But. Yeah so you know lots of thoughts come up about that right so if you take somebody from a culture and you drop them in a brand new culture and you isolate them from any speakers of their preexisting language.

[01:01:00]
And you do this in an earlier enough age they are going to suffer from you know loss of that original language to a very extensive degree now in fact since very early on in a chapter in the P.D.P. books Geoff Hinton already knew a pointed out that even if you do get.

[01:01:23]
A loss of the apparent ability to produce particular outputs for given imports there's actually usually some latent trace of that still in the network so it can be recovered a little bit faster than you might think and that's probably true in our models as well that we haven't really focused on that but setting that point aside.

[01:01:47]
The more emotion you have in the new culture the more you're going to get that effect of rapidly learning this new stuff at whatever cost to what you previously. And so this really you know some people's theories about 2nd language learning is basically exactly this that. The difference between the Kid and The grandma is that the kid is thrown into the culture.

[01:02:10]
Has to go to school every day and is immersed in learning a new language or grandma stays at home and only talk to mom and dad and they go out and do the shopping and so they get a little bit more exposure to English then the kid then the grandma but not anywhere near as much as the kid and the whole business about critical periods is all just about the center leaving I think there's also an entrenched effect.

[01:02:33]
Once waits gets really structured in certain ways the. It's harder to adjust them in new directions and. That's so yes so there's so much more to learn about the details of all that though.