You're older. You're. Already did and. Also for her environment and also your allergy. She's won a lot of as I know of yours from. Your science and so on. Thank you. Thanks for that I'm sure structure of cake for hire everyone. Yes. Hi everyone. And many thanks to David I hope you can all hear me just set this. Story. Story. OK And so. Hello everyone and many thanks again. Many thanks everyone for the opportunity to visit Atlanta for the first time and Georgia Tech. Definitely so. Also for the first time I have a great time visiting your university department the C.S.E. Department which is a very unique department you should all be very proud and I will tell you about the work that we do which hopefully will inspire. A lot of ideas with all of you and I will tell you about how when we are looking at D.N.A. My current data and we were able to predict computation only mechanisms from the data and experimentally verified these mechanisms were able to predict with an instance correctly and of course to me that I actually don't have a pointer. And even if the pointer. It's OK I'll use my hands north so I come from. I did quantum information for a living for my Ph D. and now I'm looking at Bella logical information. And in the stock I will mail talk about our data from D.N.A. microwave but you want to keep in mind that these same ideas. There are very much a political goal to data from many Heisman put technologies in biology and many other types of data then a micro is the record you know mixed signals which is why I call my lab the genomic signal processing lab and specifically So if you think about the flow of biological information in the cell. It's going from D.N.A. and the D.N.A. replicates a mentor scribe's into our in a way and the irony is translated into proteins D.N.A. by Currys are able to measure the abundance levels of all genes in a gene in a given genome say the human genome and genomes to us rather model organisms. They're able to measure the D.N.A. kopi number or the abundance level of the D.N.A. pieces that relate to each of the genes. They're able to measure the army may abundance levels are in the expression in the fruit genome and also all the protein binding to D.N.A. or the pointing binding to R.N.A. So pretty much all of the. Molecules that carry Did you know we can for actually Masel it is possible to measure religion no mix. Carol there are a bunch of levels in the cell using D.N.A. My curries except for proteins and their protein arrays and I guess must spec and other technology. There are coming up to feeling that one gap of the problem of the lack of information about the is the lack of global information about proteins and this is the first time in history that we have this information that the cells generate and also read as the cells grow also there are biological processes the cells make the decision say we are all men the transition of the onset from normal tissue to tumour the first time that we have this information we recorded that we're able to see it and already biologists by finding patterns in this kind of data are able to get great insights by classifying genes into groups of genes and then in ferrying a particular function of a gene from all the other genes that it was classified with or by classifying samples into groups of samples for and identifying you to have more samples for example. But as you start to be able to make great great advances in basic science and in Madison and come from Fenwick's what I want to tell you today that the data patterns the problems we find in the data we can use them beyond just beyond the classification of genes and samples to actually discover principles of nature to move people examples from the history of science and one example of this is that of Kepler which looked at the data which you see it's here tabulated in the Matrix in some way similar to many other forms of data did out of positions here of different planets on the X. axis of different times along the Y. axis Kepler found pot turns and these proctors corresponded to the laws of planetary motion here we're showing an ellipse the corresponds to the first law of planetary motion. So I am hoping to show you that it's possible to do the same thing. Looking at molecular biology cause they are to find powders and to associate these patterns. With principles of natures or previously unknown mechanisms and you might say well most of the patterns we find in these kind of data other patterns of correlation and I just want to mention I'm not showing this here but the third law of planetary motion. Kepler's third law it's actually a lot of correlation and this is the one that gave rise to the universal law of gravitation. So even patterns of correlation could correspond very well to mechanisms that mathematics that we use to find these patterns and it's very different from the math that Kepler has now and it's still physics inspired and we're using matrix of potential models and where do I mean by more dollars in here when we're looking at only data without equation. The models are frameworks for describing the day that were the variables the patterns that we find in the data and the operations with this patterns such as Because if occasion that we mentioned data reconstruction for example projection all these approaches actually are for is that some kind of biological reality. So for example. We use the singular value composition in the sense that it's a different generalization of the eigenvalue composition in physics in physics eigenvalue composition it gives you. I can hold and we already know that I can modes have physical meaning for example if you look at it I get modes of a bunch of interacting oscillators one of them I describe the motion of the center of mass of the system and similarly by generalizing from I guess the square symmetric matrices the people who get physics to the S.V. gear and to the rectangular matrices in which the day are organized in our case where we have here along the line access all the genes in the east genome and along the X. axis we have different time points in the east cell division time course and the data they're described here in the rust. Display in the red we have over abundance levels and green we have below our reduce the balance levels relative to the steady state of our in expression in East. So we have this matrix and now we find I give molds if you want to write inventors write and eigenvectors which we call Ivan genes they look like genes in the left hand I can factors that look like arrays which we call I give away if we find these two types of ID more than we are able to have sushi all the decor spalling they have biological meaning these correspond to the activity they captured the part turned over to Vittie of the cell or aliments the generate the observed signal in this case where you later is of the cell division cycle and this pattern corresponded to sell Marse takes in which these regulators are over active and we could show it and we could show that the very strongly correlate with experience with patterns extract measure it experimentally from cells and which is regulators who are actually experimentally overactive. So then we generalize that looking at the generalized this with the and again we think about it as a generalization of the eigenvalue decomposition sort of the generalized eigenvalue composition which in physics is used to compare kinetic and potential energy distribution of energy between the different I get modes between kinetic and potential energy. So we're comparing two matrices. And at the same time that we did this work. Actually I should mention has soon also used to generalize this with the for comparing text data. I think you are so so the sought to view the edges of the day as a comparative framework for two matrices really right is the guess at different places are about the same time that I guess. People is a similar minded people who think alike. And here we compared East cell division cycle data. The human cell division cycle data and the idea of the comparison is that the time points here. There is one to one mapping between time points but the genes in the east we don't need to map them onto the genes of the human the just media gives us the same root harness that Google is ation of the two matrices. We're now we find the same set of Pop Torrens representing the activities of genes in eastern human but some of the patterns are exclusive to the human dataset some to today is dataset some to the human and some are common to both datasets and if we're trying to interpret this patterns we find that the exclusive ones are present exclusive processors namely the response to synchronization in both or in each organism separately the synchronization processes are experimentally they are different in human and in east and the common patterns represent the common cell cycle process the Behave the preview of the car an expression behavior during cell division cycle and human and in East. And I am the first mechanism for dictated by integrating two types of data. I mentioned that within a micro You could measure different types of data for example are in a binding to proteins and sorry. For things binding to D.N.A. and R.N.A. expression and me. You could but and essentially we're doing this by a generalization of the inverse projection which in physics is used for a transformation of coordinates or for moving from one frame of reference to another. And here you are thinking about it as the moving from the frame of reference of the binding for things to D.N.A. to the frame of reference of the possibly subsequent are an expression I say possibly because we well we know that there are some causal correlations but we don't know that there is causality. In everything in the data. So let me tell you how you predict a mechanism. We're looking at these data again we have here all the genes are open reading frames in East and you only have particular prose. Some of these proteins or transcription factors that are known to drive cell division cycle are an expression and some of these proteins are replication initiation proteins and those can toast to date or they were measured to Rick Young's lab they were published in two separate publications and they were actually not analyzed. But before us by anybody together. The reason we thought we should put them together is because replication is an important process during cell division and what happens is that the cells they replicate the D.N.A. content and then they divide each into two cells with the identical D.N.A. come D.N.A. with a day with identical D.N.A. materials. So we thought well the replication it's part of the of the cell cycle let's see how does the how the binding of these proteins to D.N.A. might affect subsequent possibly subsequent expression and again we did by using the pseudo inverse projection and again as I told you one way of thinking about it is uses transformation of coordinates from the binding of proteins to D.N.A. to that of our in the expression and the other way of thinking about it is which patterns of binding of proteins to the D.N.A. actually can be this can we describe but you think the part turns of are an expression. So as we saw again we could think about it in terms of patterns and we find correlations. Alternately in the end of the day what do we find for example we find in these particular proteins that are here in green and B.P. ones we see excellence we for their projected onto the cell cycle stage G one what the mathematics tells us the tire actually comes from the are in a time course what the mathematics tell us is these proteins the but they by the adjustment to genes that show peak of expression during this. Cell cycle stage G. one and we know biologically that this is sure and we also know biologically that there is a mechanism underlying this correlation these proteins they bind near the genes that they're going to activate actually and and make the self of the cell cycle stage happen. Similarly if we look at the replication of initiation proteins. Just to remind everybody so replication in eukaryotes it's not. It doesn't happen sequentially from one end of the chromosome to the other it happens in parallel and move to print spots along the chromosomes where this particular proteins Bahrain. And where we put actually she actually all these multiple spots along the chromosome. There are less similar time asleep so when I do these proteins bind Well according to our mathematics there auntie called the two G. one day bind I just had to genes the cello reduced expression during this I'm sorry. COSELL cycle stage she won and we asked our biology friends does that seem to make sense. Turns out that they didn't know that there is actually a known correlation between the binding of these reputation issue action proteins and our an expression but the question we found made sense. We talked about John B. Free who's now the deputy director of the London Institute of Cancer Research U.K. and our whom I met. Nature. Biotechnology the Miami winter symposium a while. I'm two thousand and four he was on the south cycle we went there and talked with experts and John is an expert in D.N.A. replication initiation in even the lab they did some experiments and biologists are seen Kovel caustic experiments where we showed these proteins they bind to D.N.A. to initiate replication at the same fell cycle stage gene one and also they only known two involved with suppressing two inscription to locations in the east. You know. So what that means said that this is the plan this reduced expression of the gym that these proteins bind and just until it happens at the same time that they actually find. And because there is the mechanism underlying this correlation that we already know of we said well this not only makes sense. We are feeling free confident that there is a mechanism underlying this correlation. So we predict a mechanism. What is the mechanism will predict Riddick again it's a correlation so we don't know what causally affects what it could be the transcription regulates ripple cation so that if we have a gene that is being transcribed then it's difficult to an origin to bind the printing of that particular spot and it's been shown for a single origin of replication a plasmid in eighty eight and then after we made our prediction and actually inspired by our prediction big thought of Cornell. She's also an expert in replication initiation expanded these results to all origins of essentially genome wide she expanded it from a single origin to the whole genome in East in two thousand and six. And also could be that replication regulates transcription it could be that if we have a binding of a protein to the D.N.A. of this particular point and nearby Gene has difficulty attracting the machinery for transcription so and doing experiments there I will show you that we did later. This is what we essentially verified. But at that point we were feeling excited because this is the first time that the mathematical model of my current data actually predicted the mechanism as I said so far people concentrated on classification of genes and samples. Even if very successfully. This is a prediction of a mechanism. At that time we also started thinking about tensor say again I guess our. There was about the Tiger also met her. Soon and it seems many people started to think about tensors. At that time. Why did we think about tensors this is because they occur to us that the context in which data are measured actually matter and for example if you're thinking about building networks where you describe an matrices say a network of the London Underground and you have all the stations along the X. axis in the same stations along the Y. axis and when inside the matrix you tabulate say the number of trains that go between the stations in a single matrix to essentially describe the on an underground but in black and white because the context the train live in which we go from one station to another these appears in the field but when you are in one hundred you actually want to have that map in calories you want to know if you go from station one to station two on the Piccadilly line or on the Circle line. And in that case you can imagine having a single matrix describing this now work a single matrix describing meet this network and once you start stacking these matrices one behind the other you get an additional dimension and your data are two dimensional data becomes really three dimensional cube you do have data. And so I want to do much of the details you're asking about later we actually family a very simple high order eigenvalue composition and two to look at this kind of data and then we also turned our attention to higher order singular reality completion and why do we want to do a higher order singular value proposition earlier showed you. We find patterns say in a time course so here we have time points here we have all the genes again in East. So all cycle time course when able to find patterns among the genes and patterns a long time and we had here at square matrix in the middle suggesting to what extent this pattern is are significant in the data. I already mentioned two of these I can mode seem to have a logical meaning and were helpful in interpreting the data when that happens. If you want to see how the cell division cycle is affected by different external Imus stimuli. For example you could have some environmental chemical sitting in the environment of the cell or culture could be an oxidative stress or actually even though it's eastern it's an oxidative stress material. This is a model system for cancer so. The experience and the experience actually did it really with the idea of studying it as a model system for cancer. So you have some control time course and then you subject yourself to some chemical manner Darren and you subject yourself again to another cell to another chemical hydrogen peroxide and here we have only one additional dimension to our two dimensional data over the course of time the dimension of conditions but you can easily imagine the conditions could be moved to cool additional dimensions. It could be different concentrations different times of output cation different types of chemicals grouped by their known activity etc etc But this is just a proof of principle and why do we need to have a proof of principle. Well because when you go from a major excess video to our Higher order to read I mention and more there and higher than that as we did then all of the mathematical properties that were so nice for the matrix X. as we did You cannot preserve all of them. So if earlier with the S. video we have orthogonality of the patterns that we find. And our good analogy of a core matrix then with the higher order there when we got to higher orders then we could either preserve the OR TOO good knowledge of the pot. Turns along all the directions but then we get a flu current and so are all like him for growth there are two good knowledge he and then we get to an all the agonal core tensor and there are additional different variations on that but we cannot have both a good analogy and the agonal contents or at the same time. Why does it matter so not. Well all we all are the same the don't tug know patterns are useful for us and there are some reasons why we think that they're useful in general. So we wanted to stick with your target patterns. It's also nice that this isn't exactly composition actually but all set to even tell print the power turns it helps if we know which are the significant powers in the data and when you look at that there just this structure. It makes it a little less clear to figure out which are the significant patterns in the data and this formulation of the higher Order S V S sensually translates mathematically to the question of what is the rank of a tensor if you like which was discussed in these papers by Tammy called out and people buy is and dollars but this is the composition which was formulated in modern times by live in the lot. So where so it doesn't tell us. OK what are the patterns to interpret what is the significance of this part turns. What is the entropy of the tensor all these questions that we ask when we actually apply these the compositions to data. What do we do if we have a degenerate subspace. So but if you are occurred to us that if you think about the decomposition instead of a decomposition to two patterns along genes time and conditions to a could be made of these patterns a single pattern along the genes a single pattern a long time in a single part along the conditions we get this is rank one substance or isn't right. Quantum is well defined for tensors and we have this this cumulative data is now becoming a superposition with confessions that are calculated from that full current with core fissions. Far from each one of these. Rank one sub tensors a rank one key release of data and now we want to end. So now we can get the significance but no other movie vidual vectors. But for some potential years we can formulate the significance really similar to the way we do with S V G We can I. Define some entropy for that. And now we know which patterns we want to interpret actually we can also define and were taken in a good general subtends for space which just quickly because you are serious ear will tell you the way to think about it is what happens if you have the same pattern on the genes the same pattern a long time varies with and with simple confusions in the all in the core. Part of the two different patterns along the conditions. Well in this case you can combine these all still to rank one sub ten circuit and then you want to interpret the full combination rather than the individual ones because you are interpreting unique patterns in the data and when you have to generate R. degeneracy that means that the computation does not give you unique patterns. The problem is you wanting to print are unique. And what kind of patterns do we get where you could see that we get patterns of variation across time now because we maintain this our additional dimension of condition. We'll get a very easily to interpret pattern of variation across the condition and now if we want to ask ourselves what remains the same in spite of the variation in the conditions. Well these are all the combinations of a dissuade constant pattern of course condition combined with the pattern either one of these patterns a long time and either one of the significant powers along the genes to give us and we could interpret this. So that's why we're able to explain all of that was we were expected to find in it. There are what was known to be fair but we were also able to identify saddle variations between the two different oxidative stress or is that our premise analysis of the data were unable to identify some of them like this gene and all the genes of its. Effect. Downstream for eight or more normal people expected to find in there they were known biologically to two to act that way but the analysis of the data. Did not discover that in some of this old actually new processes for example the process of retro transposition and the process of a process. So we find many interesting. Many many other new biology call. Directions for investigation by looking at the data and maintaining that one extra dimension of conditions. We also find if we look at all of the patterns along the genes and here am I not showing you. Actually the pattern on the genes but I'm showing you a probabilistic assessment of enrichment in particular groups of genes of each one of these patterns we find that every time we have a strong response to oxidative stress which is essentially those red bars here. We also have increased expression from genes that sit near these origins of replication and it is know that oxidative stress depletes this protein C. does the six which essentially means there is no binding to our genes when there is a response to oxidative stress order is a really reduced binding to origin. So this is equivalent to the correlation we found before looking at different there and with different mathematics that was sort of the first projection. Essentially what this says is that if we found we found that when we have binding we have reduced expression of near by James. Here we see there is no by name. There is oxidative stress and we have increased expression of the very same genes so again we found computationally else we could have predicted essentially compassionately the same mechanism. So we wanted to test it. Our biology carburetor joined if we do it here. The reason he won he wanted to test how replication affects transcription simply because being an expert in what like a Surely she actually knowing the cation happens just as transcription on the same D.N.A. template. This is the question that biologists actually were interested for a long time even though we didn't really do the experiment we were not interested because we made this compute. Visual prediction you want to see if it was quick. So John sent us some polls where again he sent us something we ran experiments in my very own computational and we actually printed the in a microwave and we ran them they were doing was a great help to experimental work it was done by Joel Mayer certainly undergraduate and came to Kobayashi's from pharmacy and they were great in running the experiments. So we went out. Three different and we went to install cycle time courses under three different conditions we had a control where we had normal binding to the D.N.A. and then obviously we also had worked with cation. We had a case of and there was nobody in the C.D.C. fix the commission case where there was no binding of the origins to the D.N.A. John was able to create this thing with some poles and therefore there was no work on occasion initiation there was no additional D.N.A. replicated and then there was another case where do we have the national binding to the origins but there was no subsequent dissociation which a centrally again meant that there was no complication initiation in the normal case we have initial binding and then dissociation to enable the reputation to progress along the chromosome. So we have three conditions who trawl a place where there is no binding at all to the origins of the case when there is constant binding to the origin and we look how these three cases different in terms of there are any expression when we put them there together again we use the higher Order S V D. And I want to make a special note of it. We had several replicates of the of the data sets which is normally done you have to make sure that you get good results when you look at it. What we can. It's of D.N.A. My current data and actually when you're looking at it replicates of any about medical engineering signal such as M.R.I. E.G. or E.C.G. even just. You know Michael did I actually come to think about it any high throughput any day the comes from high throughput technology one wants to know about experimental artifacts or if you want variation in baseline. So for example here we have a source for looking at patterns that we find a long times across the replicates we have your time points which are say the even time points in the cell cycle time course and we have the all time points and the difference between them is that they were done these are two different days or two different batches of our bridges ation and we see if we actually always see. Variation in baseline it's not part turn which is the second most significant pattern in the data here which is then this step function for just being different baseline for an even time points relative to the all time points or different baseline depending on the day of a British station and at first we saw it already with S B D N A first one biologist molecular biologist were not used to high throughput technologies. They were very concerned about it because well they did not imagine I didn't think about the possibility of decomposing a single pattern into multiple patterns and that separating this variation in a disturbed blood is a step function reduce variation baseline from that in this case political behavior that is of interest where again this is shared by all this hard to put technologies in actually similar type of mathematics is used already to extract weak signals of interest from the measured signal for example in a C.G. You could even imagine. So yes you have variation baseline we have different technicians different hospitals different days and also if you have you want to do an electrocardiogram of a fitness and the fetus is inside the pregnant mom when you're actually measuring the electrocardiogram is the heartbeat of the fetus but on its superimposed the heartbeat of the man which is much stronger and all this opinion poll is also. Lots of electronic signals from the crawlers actually electricity in the hospital and a very similar mathematics of decomposition is actually used in hospitals to detect the heartbeat of the fetus which is the weakest one among all the different signals so similarly here we actually find the. Artefacts are the strongest part turns and these are the parents that we filter out of the data. Well once we interpret them and what was nice about how rigorous we did that. Now enabled us to find already facts around these different directions and then move people are all the. Matrices in court tenser back into our original Cuvier ID so the same kind of filtering of noise we could do. Also on attention basis and what did we find the experiments we find patterns that suggested and changes in the behavior even in the absence of D.N.A. replication in the east cells. We find this pattern the suggested reduced expression in the absence of replication relative to the case where we did have the control we do have a question and we find this green pattern that suggests a change in the changes between the case where we have no binding to the origins and constant binding to the origins. So each one of these latest is interesting discovery let me run the record to this one that relates to our prediction. So which are the genes that show an increased expression in the absence of binding and decreased expression when there is binding. We tried every possible. Every possible and notation of these groups of genes and the one the only one that came up with a significant Preval or a pretty good one of well about ten to the Model six is that of the genes that sit near origins or near autonomous replicating sequences. So all these genes the same. Surely they show increased expression when there is no binding and decreased expression when there is binding essentially very following our prediction. This is also the first time that experimentally it was shown that all genes overflew cation can actually affect transcription. Again the biologists are very excited about it. The reason we did experiments is because we compute usually predicted this mechanism of regulation and essentially we're very far right now. So we could say that this demonstrates for the first time that modeling of D.N.A. My current data. Not only can be used to predict biology called mechanisms that can be used to correctly predictable logical mechanisms and this additional correctness. Where this is one word means a lot. I just so I'm running here just to tell you a little bit about some other ideas that we have and some future directions. Can you predict the physical mechanism. When it turns out it has to care and. So we're really looking at this data our Even Horwitz measured their distribution of transcript flanks in East he talks. It took all the hour and a transcript. For at least he ran on one big general for a long time just separating the transcripts by links. Then he chopped the gel into little pieces extracted the R.N.A. and ran the R.N.A. from each piece. One of my Corey that further separating the deer by genes or by a D.N.A. sequence. So we're looking at is there my tricks tricks and we're looking at the right hand eigenvectors or the I can genes that we get in this matrix they turn out to form this our pot turn or we're looking at each of them separately turns out that are these vectors which are here. The color of the color graphs. They fit very well for me to find. Actions which are the eigenfunctions of the quantum of money costly later and the generalized coherent state and end with the same eigenvalues As for the generalized current state. There is eigenvalues that form and geometrical theories of the employees at the US The only is the only difference from the regular general is clear and state is that we have here and we can even fit we can see this with a parabolic potential but the difference is that we have this a symmetry so you can see that the parabolic potential the problem here is much more relaxed than over here where it's much steeper so the constant for the problem is actually much higher here and what does it mean it doesn't mean that this is the quantum system. But what means is that we have here the statistics of the generalized coherent state. So each of the prep files along with the chair always were measured for each of the genes in the experiment. They look like a Gaussian which is as I asked a metric because we have this a symmetry in the parabolas and the pics of these Gaussians they are selected from a wider Gaussian and he could feel here in the. Feet to their actual measured data of this wider Gaussian you consider profiles of individual genes and then very narrow They fit very well those are similar to a gaussian is but you could also see that you can improve easy and imagine they would fit pretty much anything you would good through them. I mean when you measure peak with just a single point anything will fit the reason we believe this is so so here are the problems we discovered and the fact that we were lucky that the course pointed to this general as clear and state where able to get the fair statistics of all of the genes in the data to look at this problem. So patterns which is why we very strongly believe the peaks are those are symmetry Gaussians So where does it mean if the peaks there are symmetry Gazans. Put Army band on jail for jealous references and you just wait a little bit of time just because of thermal broadening this band will slowly response and start to get to for the shape of a Gaussian when you turn on the electric field. What is happening is that this girl son is not a symmetrical regular gas anymore. But there's the pick of moving bend is moving toward the front of the band it's a little bit like a doppler effect on the way from the back which is why we get is the symmetry. This is where we're predicting is happening here. So our head it's an egg and there are some studies of movement there are studies there are studies actually for hundreds of years of of the behavior of. The in an R.N.A. in gel extra phrases. They suggested there would be different. Broadening for removing barriers but they never suggested this particular US symmetry like you observe here. So we now predict this the symmetry around this mechanism of the movement and in a way it's also a little funny that things like you could learn from D.N.A. Michael is which is the much newer technology about gel actor Forest Service which is essentially a much more mature technology and really the technology that enabled the D.N.A. of my car is in the first place but this would then you might ask well why would we see this what this is the is the meaning of this guy was the wider Gaussian from which the peaks of the genes are selected. Well played here this is just a rather post I prefaced this was suggesting that these were present competing every move every forces if you want essentially the system does not want to short genes. They're trying to force genes into some equilibrium lengths from being too short or from being too long. Those two forces we have two different. It's a sin a symmetric because there are two different forces. This is them there's not one sure the I guess naturally there is. No advantage to ship to shore. Just because to shore genes or transcript. They are not informative or specific enough and if they're too long it would take forever for the system to make. So we have these two competing forces but we can say now that they act in the manner of the restoring force of the harmonic oscillator we could actually find a cake or instant. For those forces and we actually continue to test this hypothesis especially by looking Well first and generalizing the same results for the global set of transcripts measured for human data also by even Horwitz and their ideas with Justin Drake also actually biology continuing undergrad. But then looking at five minutes of genes of particular function under your sunshine the live genes have a particular function. They might have particular different. Say evolutionary forces acting on their transcript length and if you're looking at just this upper block potential for the global site. You might be able to see how it marrow is and moves toward the move towards a shorter transcript size. As you're looking at the subsets of all the genes that are involved in translation. And furthermore even when you look at the subsets of all the ripest Somo genes. So you see much short term if the genes that are involved with my persona and with translation. Are much shorter than the overall global set suggesting that this is because of good because of their function. They have this extra evolutionary force that works on them is different than the average one for the global set one can think about cancer as a sort of or an evolution in action within a progression I could tell you more about this but really what we see here if we're looking at data from the cancer genome much less a guarantee for looking at this is the global sort of trial transcript. These are the genes that are overexpressed in almost. From the patients that are in the cancer genome at last and again you could see some on the dole genes are a little shorter than average and the distribution is a little narrower. But then when you're looking at the two or is the genes that are over expressed in twelve hours or even much shorter. We have this when it was moving much more to the left and the distribution is much narrower and that makes sense because tumors the cell division cycle into results we must faster than the normal cells and it's already been known that in cases when the cell division cycle is faster. Only one for a few genes and drew softly during development actually but I suppose we might apply here. The same thinking that when we have the cell division cycle is faster the cell simply do not produce longer transcript so this is where the evolutionary force comes from actually response to D.N.A. damage and he said we see the opposite effect. We see that they over expose genes actually become much wider so you want local genes to deal with D.N.A. damage and again you might say well because you remember longer James means more information in the genes you want real specific activity of genes that are going to go and think or with the D.N.A. response to D.N.A. damage. We also developing. I'm sorry I see here is the clear is the virtue developing some musical positions I mentioned to you. The higher it's a generalized S. video which was soon worked on too and. Well we want to compare more than two major says when you compare say three or more matrices and again a transition from two to three is the major transition here. So what we're proposing is a decomposition where we have a similar case to the genus with the only with more matrices. So now we're going to have for each of these matrices every composition in. The sane. Left right hand vector is in two different sets of left hand factors in two different. We are going to make he says how we call cut calculate these patterns. This is our what makes all the composition our decomposition special and essentially we calculate it from Doug realizing from what I can really composition of the center of mass of the quod she ends of the dot product of each matrix of each of these matrices with itself. It's an example to composition and we can show that in a case when we are kind of obvious in the case of our D.S. when there isn't exactly comparable to orthogonal vectors in these matrices here and a common matrix there are no zero matrices then our decomposition will immediately calculate it. If you want to do some. It tell T. is there are some eternity Royal Caribbean's that when we test but I still think by sitting with our solution when they reach to get the best result again for too good. Now it is much faster than otherwise. Because as you remember what a good analogy. He and I are going our children are always spread an exact mixture of the decomposition they're not always preserved. When you go from in this case two matrices to three and more just to show you why this is interesting because now we can compare data from say three organisms and really more than that for the cell division cycle we can see move tirelessly classify the time points and the James you know all these organisms and we are able to get cool whack to leave the course if acacia knows genes that have similar sequence and there are small merely biologist might try to group together similar sequence between man or woman is the many not there. But have different behavior so for example this one gene is just an example this one gene in point B. which is one kind of wrist. It would met by a sequence through a whole bunch of doing in the. In Servatius which is another kind of wrist but they don't have the same function some of these genes actually are anti polio in yourself. I can behavior to another group of these genes and can see this out algorithm and also the generalized differently to matrices they are the only algorithms that were proposed for comparing they are from different organisms the do not require mapping the genes of Ronald in his room to the outer all other comparisons map the genes and mapping and second stimulus or even when we know about logically does not necessarily mean functional similarity. So that we give you maybe the wrong map and well for looking a little bit at some of our next sequence data again here is if you think about having a matrix with say nucleotides or letters in the Matrix. If you translate each nucleotide into a vector that matrix turns into accumulated. So now you can think about it using. Well it's a variation on the Herald arrest ready to find patterns in the data patterns across the organisms seem to match with no taxonomy cgroups and then we could go and look at the patterns across the positions in the sequence and say What are these are for relations and along the sequence there are persona are in a way that actually identify each one of those taxonomy groups and when we do that we're able to fire you to find some known any new insertions and deletions of structures and also were able to find that. Specific ideas and Kurds as a specific and you could tides which are unpaired and are thought to be significant for the function untouchable restructure the dailies that are very significant in separating organisms into groups which is something that wasn't shown before. We're also able to find bursts in the laboratories. And differences among the same groups. Which are of interest. I told you we're looking at we finding correlations and we could use correlations to actually predict mechanisms but we have all this mathematics which we're using in order to find the correlations one to be possible to just play with some visual tool to find his mathematics but this is a project that we just started with Chris Johnson a disk Institute in Utah where the idea there is all let's visualize the data and look at it and let our brain find those correlations but have. The pope fully have some causal relations underlying them and why do we do all that this is because one day we want to be able to control this. What is this. That's a cell division cycle. And we want to be able to control it in real time. And in vivo just like say NASA controls this shuttle in our space when I tell this to physicists friends they say well you know I want to control things showed me the differential equations and what I want to emphasize again here. I suppose maybe there is no better form to do that then this is the department here. It's that we're very different to equations. There are data and then there are patterns that are extracted from the day I dismiss. There are many instances of that in the sciences and I think by at this throughput. Molecular biology is one of those instances. So it's possible to predict mechanisms even before and maybe you know even looking at your data and finding correlations in the data even before you have the understanding that allows you to write a differential equation. So I think our collaborators and my students and support lots of things to you to. Of. Right. Yes Yes And I definitely saw I mean we're looking here at the transcript length as measured in the micro races that includes the U.T. are in everything but you could just well you could just look at the length of genes. I guess the sequence lengths and yes so right possible genes are known to be shorter which is really why we looked at them here to begin with. So in this case. Yes we definitely think the reason functional reason why DON'T transcript also happen to be shorter but I guess if we look at those genes. The ones that we find and they are from their Cancer Genome Atlas and we're trying to cut safe we try to characterize those genes that are over expressed in the tumor and that are shorter to try to crack your eyes than with any say grow notation or any functional. And put them in any particular subgroup we do see a lot of right was removed. Genes overactive in the tumors. But we also see that overactive in the normals So if you're looking at just this group there is only active in tumors but now I know almost We cannot put a piece of all you on one presentation of the right to so many genes in there. It's not necessary. So we cannot say this is due to a particular function or we can say is wow they're shorter. This is one thing we could say so. Yeah you know for the same your value. All and how if if we have if we interpret the Eigen molds what. OK yes I guess in physics what they are going modes are they are the energy is right. I don't know the only interpretation I could give them. Yes it's a good question. The other professional could give the argument that the eigenvalues would be or the singular values would be the significance of if say and I didn't you know present supposed process the significance of this process in the data to what extent this particular process is manifested in the data that I mean that's not a good interpretation in the physical sense. As you have. Right. Yes yes well yes we do. Yeah I mean we use it in the same way I guess I didn't go into that at the So now I understand your question yes this is when we look into S.V. so. I'm sorry. I guess I'm just. Only your story disk with so. So imagine for a second that this is S.V. D. Source it is serious. Now we're trying to interpret all of these values as the core fishing from the superposition of those sub tensors and similarly when we do the S.V. diesel say we have are just one matrix and here we have actors and here we have the square diagonal matrix and here we have the right and vectors. Yes Again we interpret this as the we use. This is suggesting which are the significant vectors to interpreting the data totally So then the question is how many of the significant ones this one one want to keep we actually find that many of us the very most significant ones correspond to experimental artifacts. So I mean if you just growing process for genes according to the significant ones. One would cause if I them essentially according to experimental artifacts. We actually see people do that which is. You know. So someone who. Really needs to actually into print the patterns before using the before and. You know classifying genes further. But then there are always some significant parents that represent the biology and then where do you put the cutoff we put the cutoff. Pretty much I mean there are different ways to figure out where you put the cutoff we put the cutoff and moist we just go we interpret from the top as low as we can go. We try to interpret them or what we can interpret That's great. No we cannot. That's one of the cutoff. But for example in good Reilly. He's now is now in Oxford. So he was interested in this question and he came up with this probabilistic stick more all to try to figure out where to put the cutoff which one are the which ones of these significance level represent probabilistic significance as well. So he used mathematics or own on the same data and he found out that every time when we he actually wanted to show that we were a bit we interpreted patterns that didn't have probabilistic significance but it just we were lacking turned out whenever we put the cutoff this is where he is most a dollar G determined that the eigenvalues do not over the singular always do NOT suggest probabilistic significance anymore. So still anything we interpreted works. I guess. Yes. Right. X. that actually so two things you see so we want you want to remove the constant if you have a constant. And we want to remove it in the say in this case. So we want to remove this constant pattern but then you also want to remove this baseline for irrigation right so so yes and it's a sickness. And actually these two are the most significant patterns so that the decent I mean value is I mean the I guess if you normalize it it's about fifty percent of the data and this one is about fifteen percent of the data and that's an experimental already fucked. So it doesn't. So yeah we want to remove it. You're right we want to remove it. Of the data. Exactly. What. You and your.