Yeah, it's really nice to be here with you all today. This is the first time I've given one of these departmental or university level seminars since the pandemic hit, actually with a full day of meetings and stuff like that. I used to do that a lot back in the old days, but first time since today, I'm going to walk you through a bit of my own personal journey over the last few years on questions of how best to analyze neural data. Over the last six years, my group has been involved in a lot of different things. I describe my group broadly as being interested in general principles of intelligence that apply equally to artificial and natural systems. But one of the things we have to do if we want to understand general principles of intelligence, be they in a natural or artificial system, is to actually be able to analyze those systems and make statements about them and test hypotheses with them. And also, of course, do downstream stuff with them. And for that, you really need good machine learning tools. As I'm going to argue to today, I'm going to start out with what I view as the ultimate goal with ML models for neural data. I'll try to leave some time at the end of the talk because I really do want to open up a can of worms philosophically with some questions here because I think there's some critical questions for us as a community to attend to. Then I'm going to talk about some of the first steps that we engaged in in our lab to work specifically with calcium imaging data, which was a personal pet project of mine because I had done some calcium imaging work in my Phd and found the way that we analyzed it to be immensely frustrating. I'll then tell you about the change of heart. I had ultimately, our work on analyzing calcium data hit a roadblock at some point about two years ago. And I was feeling really dissatisfied with the whole endeavor. And I'm going to be very open with you about that dissatisfaction, but something changed for me in recent years. Thanks. In part, in large part due to meeting Eva Dyer and her group here at Georgia Tech. And also some broader conversations I've been having with people at Mila, the research institute where I work. That led me to a different way of thinking about it. And thinking that in fact, the key thing we have to think about for analyzing neural data is how we tokenize it. And I'll explain why. Then lastly, I'll articulate what I view as the end goal for us. What I'd like to see this type of work achieve in the next five to ten years. Or neuroscience more broadly. Let's start with the need for machine learning methods and neuroscience. It's worth noting that we've received a lot of money over the last few years. If you tally up the total amount of money spent by the NIH, various European agencies, Japanese agencies, Canadian agencies, It adds up to over $100 billion in investments in neuroscience over the last ten years. That's a lot of money to spend on a problem. The result has been that we've, of course, learned a lot about the brain and we've developed wonderful technologies for recording brain activity. We're at the point of recording brain activity that I only dreamed of as a graduate student not more than 18 years ago. To be able to record from tens of thousands, or even 1 million neurons at once in a behaving awake animal, wild stuff. But what are we actually going to use all this data for? I think there's sometimes a tendency in the neuroscience community to engage in a bit of what I call the underpants gnomes thinking. Most of you are probably too young for a South Park reference. But for those of you who are old enough, I really do think sometimes we, in neuroscience, think like the underpants gnomes from South Park, where in that episode the underpants gnomes are like we steal all the underwear, then something happens, then profit. In neuroscience, there's a little bit of this. We collect all the data, then something happens, then good stuff, trust us. But what happens is in between the good stuff and the collecting of all the data is rarely articulated, I think actually, poorly understood in the neuroscience community. Now let's maybe at least try to articulate the engle. What could we do like theoretically, if we had access to all this neural data and ways of analyzing it and that something happened, worked out, what could we do? Well, there's numerous potential applications of having good neural activity recording technologies, right? So you could of course, have brain computer interfaces for. Medical purposes, to help people who are paralyzed, but even maybe for non medical purposes, for just helping operators of large machinery. Or if we get really sci fi, just all of us to control our personal electronics. Drug efficacy prediction is a huge thing. People react very differently to psychiatric drugs. We have no way of detecting right now who's going to respond well to a given drug. Currently, it's just trial and error, doctors throw drugs at people and you keep doing that until one of them sticks. But if you could record neural activity and say like, oh, you've got this neural dynamics type, you're going to respond well to this drug boom. Great. Cut out. Three to four years of misery in patients life, Epilepsy treatment. We still are cutting out pieces of the brain to treat epilepsy. It's nuts, but theoretically, if you could record neural activity, understand where exactly the dynamics are going wrong, and regulate that with stimulation, it would be immensely helpful. I could go on and on and on throughout all of these. I think the core point is that we see many possible end games, right? Like it would be really nice psychiatric diagnosis. Like if someone's depressed to be actually able to say, yeah, you've got a real issue with your underlying neural dynamics versus you just need to change things in your life. These are two very different treatment paths forward that theoretically we could identify with neural activity. So these are all the great end games. This is where we're going to get to once that some stuff happens in between. The problem is, I think that one of the reasons we have this some stuff happens in between issue and this is honestly something you that keeps me up at night, is there's this question. And this is the philosophical can of worms I'm here to open up today. Is neural data even interpretable? Many of us would like to be able to interpret our neural data. This is a word you hear used constantly, the last cosine. There was a really fun little moment of tension when Mackenzie Mattis gave her talk on this model cebra that her group has built. And David Silo stood up and said, my question for you is you keep saying this renders the neural data interpretable. What does that mean? Mackenzie was like it wasn't clear that there was an answer to it. I think the reason is like, okay, intuitively what we mean by interpretable all of us is something like this. I would argue that I could get to the point where using either human language or a very small set of equations explain the algorithms that are in operation in the brain to you. That's roughly what we mean by interpretable. Okay, It's very obvious why you'd want that, but the question is, is that even possible with neural data? There's an assumption that we make that it's possible that we could get to the point where I could describe to you with human language, how this brain region works. How you successfully pick up a cup or decide to eat this food, or decide which assignment to do for your class, et cetera. But there's also the distinct possibility that it's just not interpretable for evolution. Had no reason to produce neural dynamics that were capable of being interpreted by the human brain that was not something that was selected for. It could very well be that the algorithms being used by the brain are sufficiently high dimensional that we can never interpret them ourselves. And we could never explain to someone else how they work. That might just be the reality we have to live with if that's the case. That means that in between part where we go from collecting all the data to solving all the problems can't involve necessarily us understanding things. I know that sounds really alarming to neuroscientists, but I'm putting it out there as the possibility that maybe we have to accept that the way we're going to get to all those wonderful, final endpoints is not necessarily by fully understanding how the brain works, but by allowing machine learning tools to understand how the brain works. For us to identify those high dimensional algorithms for us. And do the heavy lifting of figuring out what will be the response of a person to a drug. How to stimulate their brain to push it away from an epileptic seizure. How to determine what they're trying to think of when they're trying to control a computer, et cetera. And that might just be where we're going to get to. I'll be honest. I think that's where we're going to find ourselves. I've become more and more convinced over time that that's just the reality that we face as neuroscientists. It might not be satisfying in the same way that it would be satisfying to have a nice verbal description of how the brain works, but it might just be our Lt. Actually, I'm going to skip this slide. I've already pointed out it. Machine learning can help us to identify these relevant patterns in neural activity. It's done it for a lot of other data to be clear. Like that's why it's the promise here, right? Like it's very hard for human beings to understand things like protein folding. But it turns out ML systems are perfectly capable of understanding protein folding. Why not in neural activity? That brings me to the point now where I've set up what the goal is. I want to see machine learning tools that help bridge that gap of understanding that we might be incapable of possessing ourselves. When I first entered into this mindset, this is what I'm going to talk about now where I was six years ago, which was using deep, unsupervised techniques to extract latent information. Well, six years ago I started thinking about this thanks to Chapman's work. Then about four years ago, we started working on it for calcium data. And that's where I'm going to bring you to now. So. Okay. So indeed Heathan and David's work was for me, a real game changer. Mentally, I remember it was 2018 cosine. I saw David sitting there chatting with Tim Lily rap. It was the opening night and I came up to them and I was like, oh guys, what's going on? Let's go party And like David was like, I'm having a very serious conversation here. I'm going explaining something to Tim I'm like, okay. Okay. So I sat down and listened to what he was explaining to Tim and he was explaining alphads to Tim. Or laden factor analysis Dy of dynamical systems. I'm not going to go on at length about this because probably a lot of people here are familiar with it. But I want to describe the core idea because it ended up being the underpinning for our work for a long time. Like I said, for me, it was a real mental shift chief. And if at any point I say something that you disagree with, please feel free to interject the shift mentally for me. As I was sitting there listening to David explain this system to Tim, is he said, look like you get your spikes. They're totally just the expression of a few small cells, like a snapshot, a slice through this computational machinery that you're doing. But there should be some underlying neural dynamics that are the core computation that's actually being performed by the circuit. What we would want to do from recording this neural spiking data is somehow extract what those underlying dynamics are like. What is the actual computation being performed that these spikes are just giving us a tiny little window onto. And the way that these guys decided to tackle this problem is using a deep variational auto encoder. The core idea is you're going to feed your spikes into a deep neural network. And the deep neural networks job is going to be to recreate those spikes according to Bayes, optimal principles courtesy of variational auto encoder math. And the idea was you have an encoding RNN and a generator RNN. Where the generator RNN is then going to spit out for you in a series of factors, the underlying core dynamics that the system is ultimately engaged in computing. And why would running all of your data through a big variational auto encoder. So like it runs it through this recurrent neural network, it runs it through this recurrent neural network. This spits out a set of initial states. This then spits out the dynamics. You get your latent factors, those get turned into firing rates for your neurons, and then you compare to the actual spikes and you see, according to Polson distribution, whether you are predicting the spikes correctly with this model. Now, why would this possibly work? Well, it's because variational auto encoders are designed to be a technique for inferring latent variables that are actually responsible for generating the data that you receive. According to the core principles of what a variational auto encoder should do if you run your system through a variational auto encoder like this. And there's some tricks here that I haven't talked about in variational auto encoders, but it's not worth going into in this talk. But if you run data through this a system, theoretically what you're going to get out of this system is an identification of the underlying latent process that generated that data, and that should be the dynamics of your underlying computation. And indeed, it did seem to work. Alphad at its core was able to pick out some really clear dynamics. Possibly the easiest way to see that was with its ability to decode things. Now this is going to be a running theme throughout my talk and it was funnally enough in the end Mackenzie's answer to David at Cosine where she was like by interpretable, I mean decodable. Now of course, that's not what most people mean by interpretable, but I do think that decodability is at least a good metric for whether you've got a good hold of the true latents. Because if you have a good hold of the true latents, then indeed they should contain all of the information necessary to actually generate the data that those latents are controlling, in this case behavior. Here they were testing it on a maze reaching task, where an animal is doing a maze reaching task. And you can look at what happens if you train a decoder on the raw data that's just been smoothed, or maybe run through a Gaussian process or run through alphads. The answer is you can do a much better job of decoding the actual animals reaches if you run it through alphads, That really suggests you're picking up on the true latents in the system. It's a very clear advance over the previous systems that existed. And yeah, for me I was just like super excited when I saw this stuff. I was like, okay, boom, So we can start to identify the true underlying latents of a neural circuit. This might actually be a push towards understanding the real dynamics. Getting some kind of interpretability if we wanted that. Or even just doing downstream useful things like decoding, right? If you can improve your decoding, you can do a better brain computer interface. Boom, done great. I'm happy. What about other modalities? For me, I was really into calcium imaging because I had done a bunch of it in my Phd and we also had a collaboration with the Allen Institute. I'll show you, I think calcium imaging is some of the most beautiful data in the world here. This is data that I'll describe. Again, recorded from primary visual cortex of mice watching some stimuli. We did recordings in the apical dendrites of both subgranular and supergranular cells. It's this beautiful, beautiful, rich data. I really wanted to be able to take advantage of this data. Now, Lp ads had been designed for spiking. We thought, well, how can we modify this to make it appropriate for calcium imaging? The, the way that we approached this is we reasoned that this was really a hierarchical dynamics situation. That's what we were dealing with. The core idea behind L Fads is that you have this underlying dynamical system, that is your computational dynamics. This is what you're interested in. You have some raw data which in the initial L Fads was spikes. And they presumably reflect a fairly simple point process plus on generation based on these dynamics. But in calcium imaging you have this intervening set of dynamics which are the calcium dynamics. In addition to the underlying computational dynamics that are occurring, you're observing data that's being mediated through a fairly complicated dynamical system that is determining calcium influx and calcium release and all these other things. We reasoned that if we could incorporate appropriate inductive biases into a deep neural network, a modification of Al Fads, then we could get it to learn both the latent dynamics and the calcium dynamics that we see in this data. The approach we took for this was what's called a variational ladder auto encoder, Hence the name of our system, the cute Valpaca. The structure of this neural network is as follows. You feed in your raw calcium fluorescence traces. Those get first fed into a network whose job is basically to infer the calcium dynamics themselves. It's like a little mini or the calcium dynamics in and of themselves. It's going to do that variational basian inference to understand those underlying calcium dynamics and recreate those via Gaussian likelihood. Here we imagine the Gaussian likelihood was appropriate because we're assuming that the experimental noise in fluorescence imaging is Gaussian distributed. But then this is why it's called a ladder auto encoder. You also feed some of your inferred activity up to another variational auto encoder. So this is a hierarchy of variational auto encoders. And this variational auto encoders job is to infer those computational dynamics. What we do then is we tie these together by doing a plusson likelihood comparison between the rates this model infers should be going on and the underlying rates that we use to generate our calcium data. This was the structure of Valpaca. We thought it was rather clever. It did seem to work here. We did some analysis on synthetic data. We generated data using a Lorenz chaotic attractor. We ran that through a model of calcium dynamics. Then we added some noise to it to get out a fake calcium signal. What we found was that if we look at our ability to recreate the Lorenz State or to recreate the underlying firing rates that created that data, we were doing better than if we did a calcium deconvolution to get estimates of spikes out and apply the alphads to that. Or simply modify the cost function of alphads to have a distribution at the output. We were ultimately able to capture those Lorenz state variables and the firing rates better than these other more straightforward approaches that you might take. That was encouraging. Then Mark Churchland and others everyone sheath. And they were all very friendly about this and it was great. They also shared some data from monkeys performing a reaching task. So we got spiking data, we created semisynthetic data by running those spikes through a calcium dynamics model to generate calcium fluorescence traces. And then we examined what happened when we ran our system on those. Previously, it had been shown via L Fads and other systems that you get these beautiful rotational dynamics when animals are doing these reaching tasks. We showed that if you just applied L Fads to deconvolve thlorescence trace, it didn't seem to work very well. But applying val Paca to this gave you these really nice rotational dynamics again. Then lastly again, decoding. We had some data from the Allen Institute where mice were sitting on a wheel watching some stimuli. They were getting recorded in primary visual cortex. And they were watching these frames of Gabors, where the animals would see sequences of Gabors that all had roughly the same orientation. And then occasionally, the fourth frame in the sequence would have a very different orientation, what we called an unexpected frame. We looked at our ability here to decode, whether or not we were looking at an unexpected frame or an expected frame. From this here you're looking at features that you get out of either running raw PCA on your data or Val Paca. You get traces that are better, ultimately, at identifying the changes that occur for those unexpected frames. This is quantified here in terms of our ability to decode it. If we just decode off of the features that we get from raw PCA we don't do as well as if we decode the orientation of that frame from Alpaca. Okay, this all looked really good. We were excited about this. We wrote up a paper and put it on archive. And then courtesy of the whole review process in science, which I oscillate on, thinking is an amazing thing and a terrible thing. This was maybe one of those instances where it was actually good for scientists. We kept hitting two walls. Reviewers kept asking us for various things that were perfectly reasonable. Asks in particular, one of the things that they kept asking us for is, well, tell us about your hyperparameters, how they're being selected, how robust is this to hyperparameter selection, et cetera. The honest answer, once we really forced ourselves to explore this more thoroughly, was that it was not robust to hyperparameter selection. It was super sensitive to hyperparameter selection. You would get radically different answers depending upon the hyperparameters that you selected. And the places where you got good results were tiny, tiny dinky bits of hyperparameter space, which is not good. It means that people aren't going to be able to practically use this system in reality. The other thing is we weren't getting good transfer across sessions. So another problem that I think we face, and I'm going to come to this in the final part of the talk, is if we really want machine learning tools to carry some of the heavy lifting for us in terms of our inability to understand how neural systems work. We want to be able to apply these machine learning systems to all individuals, right? It shouldn't be that I have a different model for every person and every animal. I want a single model for everyone. Because I want to put that neural activity into a universal latent space that tells me something that's true about any brain, not just a brain. So we really wanted to apply Val Paca to many different animals. And we tried and we tried, and we tried, and it just always got worse when we applied it to multiple animals. Every time we put in another animal into the pipeline with a different tweak to the way we were inputting the data. It still just got worse in frustration. At some point, my postdoc had gone off and gotten a job at Graph Core. And we were in this review process and at some point we were just like, okay, suck it. We're not doing this anymore. We're not going to try to get this paper published. It's on archive. A part of it was even like, should we retract it But like, I don't know, it's not quite a retraction territory because it's not that anything we said in the paper was false. It's just that it doesn't like articulate the ways in which this isn't quite a satisfying system. So we're just going to leave it there. But we decided to abandon this project essentially. Now I should note, Cheathens Group has also done some work on extending alphads to calcium imaging. Data radical looks really cool in many ways, but I don't know if you guys maybe find some of the same problems that we found here with sensitivity to hyperparameters and inability to transfer. I think it might be unfortunately like a core problem with the variational auto encoder approach to these issues. I think VA's might be very sensitive to some of this stuff, but I don't know, That's speculation. Okay. One of the things that I thought could also be a problem here, and this might actually be more of the problem, is an inability to scale. This gets to the transfer issue, machine learning systems. I think the most important lesson in AI over the last 15 years has been what Rich Sutton calls the Bitter Less if you're unfamiliar with the Bitter bitter lesson is as follows. Over the course of the progression of artificial intelligence, what researchers have repeatedly encountered is that though it is important to be clever about the way that you engineer your systems. Though it is important to have neural networks or machine learning systems that are designed well. At the end of the day, any system will only show you what it's truly capable of when you scale up the amount of compute that you throw at the problem, when you scale up the size of the model and you scale up the size of the dataset that you feed into it in AI, over the last five years, almost every major advance has actually been just because of increasing the scale of the models. This has been what open AI specializes in, right? Like they didn't invent an to be clear, right? Like open AI did not invent transformers. They didn't invent a new loss function, they did, there are a couple of architectural tweaks, but what they do have is the knowledge, the know how to increase the size of the models, which is very non trivial, to be clear. I'm not here to trivialize that. Actually building a model with 1 trillion parameters is insane. It's really challenging to do, but if you do it, if you successfully do it, you get way more bang for your buck out of your ML systems. I think the bitter lesson surely applies to the analysis of neural data as well. I am sure that we will get much more out of our data if we can drastically increase the scale of our models and train across all the data out there. I started asking myself, how could we maybe do this? How could we train on literally every piece of data? So I was thinking about the calcium imaging problem and I was like, well, okay, the problem is that every animal has different neurons. All right. So like I record a bunch of neurons in one animal, I record a bunch of neurons in another animal. And who's to say that those neurons relate to each other or have anything to do with each other necessarily? Like there are various approaches for potentially doing. Like David with Elphads originally said, okay, well we're going to do multi session stuff by feeding different sessions in, through different linear encoders to get you into the initial embedding. That's great, but even that assumes that all of the neurons are somehow just a linear transformation of each other. It's not clear that that's true. It means that you're operating, once you shift to neurons, you're potentially operating in a space where, as it were, the words being used by the system are different. It's like you've got different systems speaking different languages. And you're trying to feed all of these things into one model. And it doesn't really work, I thought. Okay, what if we just keep it really native? All calcium imaging receives video data or has video data. You record a video of the thing. What if we just analyzed the video data? We built this system. This was basically a cookie idea where what we did is we took raw calcium videos and we'd feed them through a three D deep convolutional neural network. Then we would use an unsupervised technique called Contrastive predictive coding to predict the next step of the calcium videos. Basically, we feed in a set of calcium videos. This a latent representation just generates a contextual representation up here, which generates a latent representation we compare to the actual z that we get from the data in the next time step. And then we do contrastive learning, meaning that we push z hat and z together and we repel z from other zs from previous times steps or future time steps. This form of contrastive learning works quite well then the idea is you'd freeze your network, feed your calcium imaging data into that network, and then decode things or. Diseases or whatever. Let me give you a concrete example here. We had some data that we did in a pilot study with Mark Brandon's lab at Mcgill. They had done a bunch of calcium imaging recordings in the CA, one of mice who were exploring mazes of different shapes, they basically had this modifiable maze, so they could put little blocks into it to create different shape mazes. And they ran the animals over many, many days, over a month. Just putting them into these different shape mazes and collecting calcium imaging data. In the CA one, we trained this neural network that I just showed you on this data, we basically just took the raw video, raw single photon videos from CA one fed it into a big three D convent did contrast of learning on that and then we looked at our ability to decode which of these environments the animal was in. What we were after here was good multi session training because as I said, I think what we really need are models that apply across individuals. I'm done with models that only work for one individual. It seemed to work. They had just recorded five mice for us. We looked at if you just train this model on one mouse, and you look at the confusion matrix here. Here's the ground truth of which of those ten environments where the mouse was in. Here's what was predicted. If you train on just one mouse, it looks pretty sloppy. But if we trained on four mice, we could then transfer to a new mice and get pretty good decoding on which environment they were in that's just fed off of the raw calcium data. I was pretty excited by this. I was like, okay, this is it. We can train across every calcium imaging dataset ever. We're just going to feed the raw videos from that calcium imaging into this one model. It'll be huge, but it's going to decode all the stuff for us. But that didn't really work. I think the principle was good. I think the principle was good. But we hit the following roadblocks. First of all, it was insanely compute intensive. Doing three D comms on calcium imaging was like just killing our GPU's. We're not open AI, we don't have thousands of InvidioGPU's to throw at this problem. At some point my student was like, yeah, like this is cool Blake, but I'm tired of wasting my life on this watching the GPU's run. I was like, okay, that's fair. Then of course, the other issue is it's not going to necessarily work for other modalities. Like maybe it would work for other imaging modalities. You could even think about doing a version of it with things like neural pixels, where you treat the feature like each electrode as a pixel and you just treat it as an image that you work with. But again, it's going to be super compute intensive and whether it would work well with that system is unclear. We gave up on that as well. Let me now bring you to a bit of hopefulness. After years of being frustrated by these various endeavors, I think I've finally come to what is a more general approach and it's all about being good, about how you design tokens to feed into a transformer. Let's talk about that. This shift in mentality came about for me because, okay, to reiterate our thing here again, what I really want is I want to be able to train across many different datasets containing many different subjects. Like if I've got a model of one monkey doing a reaching task and I can say something about the neurodynamics going on here. That's all very well and good, but what I'd really like to be able to do is to not just train on this one individual, but I want to be able to zoom out to every possible individual. I want to train on monkeys doing different reaching tasks. Maybe non reaching tasks, I want to train on datasets from every lab in the world. Because I want scale PT was trained on trillions of words. I need trillions of spikes. I'm not interested in your dataset with five monkeys. I want every dataset that's ever been collected in neuroscience. How am I going to possibly do that? Well, the answer, luckily, oh, this is okay. So the answer, I think lies in resolving this problem. As I already articulated, the problem that we face is that the neurons that we record in any two sessions are not identical. Nor are they necessarily a linear combination of some underlying features, right? You're dropping electrodes into a system with hundreds of millions or billions of neurons, depending on the species you work on. The chance that you just happened to sample a similar set of neurons in two individuals is like infinitesimally small. You have to accept the fact that you've gotten different neurons in every recording and You know, if you have a vanilla sequence model where you're feeding each neuron in, as a separate channel to your model, which is the traditional approach, right? We treat each neuron as a separate dimension in our input, a separate channel, and we feed that into our model. And now I give you a different set of neurons from a different individual. Which channel do these go to? And how am I going to figure that out? And should there even be a similar channel they go to? What if they're totally different channels? It seems a very intractable problem. And this was why I had wanted to move to raw imaging because I was like, oh, well, you don't have to worry about it if you just feed the raw image and let the ML system discover it. But as I said, it doesn't scale well. I spent some time feeling frustrated with this problem. And then the answer came about courtesy of my colleagues here at Georgia Tech, Eva Dyer and Mediazabu, their solution they called. When I first saw this acronym, I was like, guys, come on But I've actually really come to love it because it really does describe the model. Well, if you don't know, I guess Polo is the sound that Kirby makes when he eats things. As I've articulated to you, our philosophy here is we want to eat all the data. This model, Poo, is a really clever solution to that problem of different neurons. Let's talk about this now. The core idea is to tokenize neural data in the way that we do language and transformers in transformer models and machine learning. What you do is you have your word and it's word number, whatever out of your lexicon. You run that through a very simple embedding system to transform that word into a vector that a neural network can work with. That vector becomes your token that you do your sequence modeling on. And in language, of course, there are some reasonable and obvious ways to tokenize things. You can do some very simple clustering on sequences of characters. And there are certain sequences of characters that are always going to pop up and that are consistent across text that you receive, no matter what the text is. But this is the core idea. Then each of your words becomes a vector in this space and you do sequential modeling over these vectors. What Eva and Meti said is, well, what if we could do something similar with neurons? What if we had a way of embedding unit through into this fx and then working with that. Now in practice, in principle that sounds good, but in practice the issue is the one I just said which is the embedding is somewhat obvious for words, you know, it's clear like across text. Okay, like the word dog is going to be the same word, dog. Roughly, maybe sometimes the underlying meaning shifts. But like, it's not unreasonable to treat the word dog like the same token in every piece of text you train your neural network on. But it's not reasonable to say that whatever happened to be unit 911 in this recording is the same across all of your recordings. That would be stupid. So how do you deal with this? This is the tokenization of text. We receive our sentence, we chunk it up into these vectors that we receive into our large language model. Instead, what even Mei said is they said, let's not tokenize the neurons, let's tokenize individual spikes. Let's do so in a way that the transformer learn about the relationship between different neurons and different datasets. Such that they can be radically different neurons, but maybe they have some interdependent relationship between them. What they did is they tokenize spikes rather than neurons. They take a variable that indicates literally just the ID of the unit paired with the recording ID. You basically just say like this was neuron 1029 in recording 516. You also add to that timestamp, you're going to take this information and you're going to run this through a learnable embedding. In this way, every spike of every neuron can become its own custom word for that dataset. It's like you can learn a separate lexicon for every dataset. This was like for me, I was like, when I saw this, Boom. Okay, I get it. Yes, that's going to work. It does work. Here's an illustration of the process. Every spike receives a token. And note that now it doesn't matter if you've got different neurons in different datasets, right? Because I don't need a set of channels that have to be the same across datasets. Every spike I receive, I just generate a new token. This is this is a spike from this unit in this recording. It occurred at this time. That's it. And then you feed in a sequence of tokens into your neural network. The architecture looks like this. Initially what we do is we take those timestamp and unit ID embeddings, or rather ID's. And then we run those through a learnable embedding system that uses a cross attention mechanism. Where there's a series of learned latents for that embedding and those get compared via cross attention with your spikes. That comparison is what allows the system to a flexible language for treating spikes from different neurons in different recordings as being somehow related to one another. Even if they're not identical to each other. Or a linear combination of them. Because cross attention is a very flexible mechanism. It allows the system to compare across all the spikes. Right? So that's the thing it's going to do, cross attention, and it's going to look across, okay, for all these spikes, which spikes should be treated the same or different or whatever? And learn this flexible lexicon. After running through the cross attention, you then have a series of self attention layers per a standard transformer model. Then in this case, what Eva and Medi were interested in was doing direct behavioral decoding. We're not doing unsupervised latent space recovery here, we're just doing direct behavioral decoding. They also feed, in some tokens, requesting information about a particular behavior. This, again, is important for being flexible across recording sessions because that way you can have a different behavior in every recording. Run that through one more cross attention mechanism, before finally outputting the predicted behavior that you have requested in an initial set of tasks. We examined this across a range of these reaching tasks. The first, I should just say, yes, I forgot I included this. If you just look at a single session, even polio does really well often here, this was just training on a maze reaching session. If you compare to a lot of the other techniques for some good latent representations or doing direct decoding, Poo is ultimately better at extracting the behaviors that are occurring in this system. But I think for me, the even more exciting thing is that Poo works with multiple sessions. As I said, what you can do is you can treat each of these behaviors as a different set of queries, each of the recording sessions as a different set of tokens. You batch these all up together and feed them into your transformer. It actually works here we're looking at decoding in single sessions versus multi sessions for a series of different datasets collected by different labs as well. We're not even in a situation, we're in a single lab. These are different datasets collected across different labs, different monkeys doing different reaching tasks. For me, the key thing is that in each of these plots, the multi dataset model does better than the single session model. And that's exciting, because that means scaling is working. That means we can take advantage of the bitter lesson. Because if it gets better, when I add more monkeys in, even if they're doing different tasks, even if it's from a different lab, even in some cases it's in a different brain region. That means I can now train on all the data. I can just hover it all up. And my model should keep getting better and better and better. And we can pull an open AI and just go massive on scale and eventually get all the stuff we want. This is illustrated here also with respect to the scale of the model. Here Medi was exploring what happens when you increase the scale of your model in terms of the number of spikes that you're feeding into your training set and the number of layers that your model has. We're looking at, again, the R squared with the actual behavior. So this is our ability to decode behavior. This is what we get from a single session model on this particular reaching task. Then we have all of the different random target reaching tasks. And you can see moving from a single session to all of the random targets, you get this very clear increase in your ability to decode. Now we add a set of center out reaching tasks, boom. Again, increase your ability to decode. Here you can see you're increasing the number of spikes. Theoretically, this curve could keep going, right? We could keep going on this out to 1 billion spikes, 1 trillion spikes. And that R square should keep going up. In particular, it should keep going up if we increase the size of our model. So that's the other key thing to really take advantage of that data scale. You need big models and that's again, a key thing that we saw in polio, is that when you increase the number of layers, you also get these additional benefits. Okay, it looks like we're really in a good place then. For taking to heart the bitter lesson. Here even we did, this is a fun little example. Transfer across species met and a guy who works with us at Mila, they trained a model on all that monkey data, decoding monkeys doing the reaching tasks. And then they applied it to a dataset from will it it all of people who had Utah rays in their motor cortex and were asked to imagine writing characters. And the task is to decode what character they were imagining. If you train a model of equivalent architecture on just a single session from a human, these are the sort of error rates that we get. Not bad, okay? You can get an error rate of around 15 to 20% but if you pre train polio first on all the monkey data, you crunch that error down. And that's super exciting because that means not only can we train on all the data from a given species, but all the species. Now I can train on all the mouse data, all the monkey data, and use it in human beings. For me, that was like super exciting. As an aside, we ended up deciding to apply this to calcium imaging data as well. We came back to this data set and this is a fun example of how it can be beneficial. This is really hot off the press data. The next couple of plots are going to be shambolic aesthetically, But anyway, we had this data where we had done recordings in both the dendrites and the somata of mice observing these sequences of these Gabor frames. One of the things we're interested in is, could we do multi session training across dendrites and somata. Like in principle, that sounds a little bit crazy. Could you train a single model on data from both dendrites and somata? Why the hell should that work? But I think poo makes it possible. The poo philosophy that we adopted here is we said, okay, what we're going to do is we're going to notice the fact that in the dendrites versus the Somata, you have very different shapes. To your ROI's right, dendrites have long ROI's with relatively little area in fact, whereas Somata have large area, nice round ROI's. So let's say that you provide this information to Poo, you don't just tell it here an event occurred at this time. It was from this neuron, it was from this recording, but you also say, and it was from this ROI that has this width, height, and area. We prepare now a set of additional features to feed into Poo, also the height, width, and area, and the position within the imaging frame. Now our input to the Poo embedding scheme becomes the spike. Spike type variable, the value embedding, this is the size of the calcium transient. Then our ROI positions and ROI features, these are all concatenated together. The spike ID gets added to these. And this is what then gets run through that cross attention mechanism to develop your embeddings for the transformer. If we do that, we actually get good transfer across somatic and dendritic sessions. This is a funny plot, but I'll break this down very briefly for you. One of the things we found with this data is that an MLP by itself could actually do pretty well on a single session. It was actually a pretty decent model. We were having trouble often with Val Paca, for example, to get good decoding of the orientations that could beat just an MLP trained on a single session. And we thought, oh, well, if we could do multi session stuff, we'd do better. But we kept struggling with that. But what we find is that with polio, with these additional features of the ROI is added in, now we can get reasonably well past that MLP level. The other thing is this is not sensitive to hyperparameters. We've basically done no hyperparameter tuning here. It's just taken from the initial hyperparameters we had from spiking data. This just seems to work out of the box. It's because I think the transformer is able to learn this flexible lexicon for different neural data. And it's a really powerful approach, I think, for doing multi session stuff now. It means we could train across dendrites, across brain regions, across species, across tasks. Boom, bitter lesson can finally be taken into account. That brings me to the end goal, and I know I need to finish up. Where I ultimately want to see us move towards is a foundation model for neuroscience. There's a lot of data out there. I just want to emphasize that for everyone, if we consider all the different modalities like it's just insane. How many hours of neuro data have been recorded and freely released. Almost none of which is getting used in a single model, right? The challenge is the fact that each brain is unique. They're not speaking the same language. We need a universal translators for brain that can take account of the ways in which each neuron is different. Each behavior is different, each stimulus is different, and each recording modality in species is different. I think that if we had a universal translator for brains, this would allow us to finally make sense of all this diverse data. Because what it would do is it would take that neural activity data and put it into a universal latent space that we could then use for decoding diagnosis, treatment, prediction, control of neural activity, et cetera. And you wouldn't have to worry about the fact that you didn't have enough data for any individual because this is the key problem. Theoretically, this would work on a single individual. If I could record hundreds of thousands of hours from that individual, I give you a Utah array. And now I want you to spend the next ten years of your life just giving me data to train my ML model. That's unfeasible. But what is feasible is if I have a system that is trained on all of that other data out there first, that I can fine tune on it rapidly with just say an hour or two of data from a single human and get good performance. And what I'm really articulating here is the vision of a Neuro foundation model in AI research. A foundation model refers to a large neural network that's been pre trained on huge amounts of diverse data and then fine tuned for downstream tasks. And I think that's really what we want to do here. We want to feed in all of the different types of data into a single model. Pretrain it in some way, possibly in a self supervised manner, and then have it available for fine tuning for these various different applications. Drug efficacy prediction, disease risk prediction, brain computer interface, whatever. Okay, to finish up, I'll say the hope is that if we did that, we could then after pre training the model, release it and then any group around the world could use it. This is my ultimate dream is that we collectively, as neuroscientists, will produce this large pretrained model that could be used by everyone else for their clinical or research purposes. We can now peer into brains with ever more sophisticated technology. It's incredible what we can do. Young me would have been so impressed with the recording technologies we have, but I don't feel like we're using the data sufficiently yet and I really think this is the direction we need to go in to do so. So that's that. Thank you very much for listening. Thank you to all my collaborators and students and of course my sugar daddies for funding me. And thank you, chiefin, for inviting me and having me here at Georgia Tech. Yeah, yeah. It seems like over the course of the talk, there is a pivot from interpretability being something to strike for which methods like alps built into the Let's just build the biggest thing possible and get the biggest tar squared in decoding. Yeah. So is interpretability just forget about it? No, no. Yeah. You very accurately observed that trajectory. Because that has been my personal trajectory over the last six years. I will admit, possibly to the chagrin of some people here. I don't know. I've given up on interpretability. As I said in the intro, I'm not convinced that interpretability is possible for the brain is basically what I've come to. And I think there's some interesting questions there and we can discuss. Yeah. Because I think, yeah, so in some of those tasks like alphas does pretty well, almost as good as polio, but arguably has interpretable latency. Absolutely. We sacrifice it all for like 4% R squared. Right? Right, absolutely. So the question is whether all neural circuits are going to be nicely interpretable like some of those neural circuits that we've applied alpha to successfully. I think the challenge is like when it comes to, so a lot of the work, and this was the case for our paper on Valpaca as well. You take a system where you already know the end answer you want and then you show you can derive it from. Right, great. But what if you don't know what that end answer should look like? What does it mean to be interpretable? I think there's some real challenges there now. Heathen and Nick are thinking hard about this and I think they've got the right approach for starting to ask this question and whether it's even possible For the record guys, I hope your competition proves me wrong, that in fact interpretability is possible. And I'd really like to see that happen. And maybe it is for the record, like this is just a sort like intuition. I very much so have pivoted to, you know what, forget about interpretability, I want that boost of 4% on my T, at least then I can have something that might help people in the real world. And that's my goal to some extent. Yeah. Sorry. I saw a very quick hand shoot out there. But Nick, I'm sure, has a burning answer to this. So you're on the side of the underpants sno to switch your answer. Well, almost I am on the side of the underpants gnomes that have been hired at open AI, which is to say I have yes, drunk, the cooled, and now believe that the answer can be as simple as you need to pick a good tokenization regime, and you need to have some reasonable model designs, but then scale should be able to bring you a of the way You got it. Sorry. Yeah, you had a question, okay. And you motivate your talk, right, With all the amazing implications that possible with the kind of approach. And so from that perspective, from getting practical outcomes, it makes a lot of sense. Same thing for language model. They can do things. Yeah. Right. But neuroscientists, they have presumably also like a basic science goal of the still in the fundamental principles of how to systems work. Right. Which is directly related to interpretability. Are those separate for you? Do you just like not care about the basic science quest or whatever like the theory of everything in the brain, Right? My answer to that is that if you had asked me that question four years ago, I would have said, oh like absolutely I care about the basic science quest. That is what I care most about, all this other stuff is fluff. But I suppose I have been changing philosophically, the reason is twofold. One is my fear that maybe we might just be upshitz creek in neuroscience and maybe it's the case that we're not going to be able to fully get an easy to understand description of how the brain works, but the other desire or the realization rather that in many applications in this world, many of the technologies that have the biggest impact on people's day to day lives, people did it without the basic science understanding. This came about for me because I heard interview after interview after interview with neuroscientists when we were talking about like brain computer interfaces or diagnoses or drug treatment, et cetera. And they would say, well, they'd say to the journalist, the problem is we don't understand how the brain works first. We have to understand how the brain works. At first, I was always going along with that. I'm like, yeah, you're right, you're right, you're right. And then I started looking at the history of other sciences, chemistry, physics, optics. These people would do stuff without really knowing why it worked. But it worked and it had huge impacts on the world. This is a very cynical take, but here it goes. If we want that tap to stay open for neuroscience is, no, I'm serious. If you want to keep receiving those hundreds of billions of dollars of research money to get to that basic science understanding, it's time to start delivering some stuff that actually helps people. If we don't, we're not going to keep getting the funding. Guaranteed politicians are going to pull it back. Even though some part of me really likes the vision, I want to see that basic science understanding, and I hope I'm wrong about interpretability, I really do. I think we still need to get on with this job anyway, right now, because we need to start producing some concrete results for the funders. Yeah. As you're trying to put more and more data types is your like token vector just can get longer and longer and keep adding more like modality specific features. How do you envision that? Well, you don't have to keep increasing the length of your token sequences necessarily as you add more modalities, because you can just view that as almost additional text that you're feeding in. Right? So you can just start to take in more modalities and feed those in as if they were additional texts into the model and keep your context window the same. The only thing I'd say with that though is the mentionality of the token itself. Oh, the dimensionality of the token itself. No, I don't think that's going to be a big bottleneck depending upon the tokens you design. And this is where token design is going to be the key to this. This is going to be the sort of clever engineering thing over the next few years. How are we going to design our tokens? So if I take FMRI data or EEG data, what should the token look like? You want it to be a token that's somehow commensurate to a spike of some sort. And that's how you want to start to think about it and maybe that's going to have to have a different form. But I think the core principle is there, Maybe all you need is you need to say, where was the electrode on the skull? Maybe give it a breakdown of some of the frequency composition at that moment in time. And then some other information about like it was this subject and this task, et cetera. And you feed that into tokens of probably the same dimensionality. I suspect it would work because there's probably enough dimensionality in one of these tokens to handle it. A few hundred dimensions is enough for that level of information would be my guess. But we'll see. I think we're done. Okay, everybody, let's think quick. Yeah.