Today I'm pleased to say that we have distinguished guests from Brasil the first or that she responded you know he's professor why is the physics of versity. Legal in us and she also did need to go. Obviously this was rubbish. The rest of us are still out of the shop students are in the national the senior national biostatistics society. She is on board of directors and Secretary of the library for Brazilians obviously got issues. She was last year. It's less than the organisation Committee I've seen up to date which is the Syrian National was you more than refuse to use it for THE LEAD. She got me to the University of New Kalina job if you are to be rational about the pace and in France last year I was with huge research volves I was the V.C. got you. Shoes the modern nation our interview process but now here is your solution we can say our life is gone they don't mind must leave us of genetics and today she will be talking about how much a native. Does Mogo spot. Sickness. Thank you very much for the invitation. It's a pleasure to be near her tried to tell. Make a small introduction of biology. Even though I think I don't need to hear you. Usually the audience is more just that delusions. So I'll go first over it just so you can have an idea of the type of data set. I'm working on. OK I will talk briefly about the him in distance to the compositions. And then I will talk about the homogeny to testimony groups of D.N.A. simple is when you have a certain kind of data set up for each individual who have a seek with a D.N.A. sequence. Then I would talk briefly about the distance measures in Meeker satellite data which is another kind of data. I will briefly explain the two. Different kind of doing this that we can work with then whom would you need to test with this kind of datum because satellite sequences in our all explain briefly. Mission to apply cations. So. If you have. The D.N.A. a D.N.A. sequence you know that you. You have a double Well X. sequence is always the T. pairs with A in C. pairs with D. in if you are a deployed individual you get like a pair of sequences. So you have a C. T. G. Your T.. So the type of data that I can I will get from each individual would be a sequence of letters. Like this sequence is of letters. The difference with satellite data is that in some part of the genome that it's not caught. Did for a protein or. You have been to an annex owns Of course you know in the part that it's not called that sometimes get a bunch of letters that doesn't count for anything but it seems that has a lot of variation in people are starting to wonder why it gets so much variation in why if it's associated with some disease or not so. A nick or satellite data or a variable number of ten then repeat what you have is repeated D.N.A. composed by tool to six base pairs. Like this one this one I'll have eleven. A sees a Seems they see like if you have to base pairs that are repeating this one. You have six repeat Asians. Like C G C C eighty. And this is repeated six times. So in need of satellite data of what I will see for each individual in in the certain locals what I would like to see. Is. Thank you is how merely repeat th NS I have in a certain locus for each individual. OK. So first thinking about the D.N.A. sequences only the letters then briefly talk about the meter satellite data. If you have a vector X I I index in the individual. So for each individual have a sequence of letters can be a C. T. your G. in this case we have four categories. So the hem indices would be when you look at a pair. Sequences for individual are individual Jay. Would like to see what is the difference between these two individuals. So looking at the pair of sequences to him in distances would be the proportion of sites in that sequence where X.I.I. next day you first saw the Mindy's so the measure we're talking about here is the distance measured within the group so within a group will make all the pair wise comparisons. If we have a group with N. G. individuals for and group G. will make all the comparisons between the individuals so. This would be the distance measures between groups within group I'm sorry. So far between groups. We all compare an individual from Group. J G in and intervene drop from Group G. prime. OK And then we'll make all the pair wise comparisons between one interview with the individual from one group an individual from the other group. And you see what the proportion of difference you can get. And divide for all the comparisons you can make which is indeed times in G. prime. OK. So if Ill have to get the ball say a ball or have they hear from Senators remember from our own Candy. So if feel. Get the pulled sample make all the comparisons between individuals disregard in the groups you will have this what I called again. The bar in zero. This is if we make the comparison disregarding the group between all the individuals we have so in we have any individuals in the total for the pulled sample. OK So but this we can Deacon polls as a measure with Dean in the measure between. So for the homos in it to test what I'm going to do is to test where they're the population the ages for the him in distress for the population in the D's are for the sample. OK so what I mean what I'm trying to test is if the distance that I have between groups. Can be expressed as the distance with the in between like the average of this so far of the not hypothesis would be that the the if there is no difference between groups that this measure between groups would be just the average of the within groups distance. OK in the the alternative hypothesis would be that this is greater than this because of some mathematical we see that it can only be greater than or equal so that's why the they are turned out to hypothesis it's one sided. So as we can doing some math. We can write down the between measure as this. OK the twice the distance is with the in between groups in then by those each with them. Group. So doing with some results of use statistics. We can get there. Syntactic distribution of this the statistics in we can actually do some appley cation with it. Who say later on that in my appreciation we can even if we don't have to bother about this and we see if I can work with both who commit fraud here. We don't need to bother with the distribution which is a linear combination of guys squares but we don't need to bother or you just do re simply in techniques like Bootstrap Project knife. To do the test. And here is that in the other situation if we have you stead of have in the sequence of letters you have and hope for each individual the the letters but you all only going to called the number of reputations you have in each side for each individual. OK So why here. Why G T It's counting the number of units of reputations. Of the I have a copy in. The death group in a specific locus at time T.. OK So we hear there is a trend on the run. So here I'm just looking at the difference between two copies copy I in our prime from the same group in here. If I look at different groups of K. was in the same idea of the him in distance we're using so he can have now this distance. The difference here is that. Looking at MC or satellite data for deployed individuals. We're always going to have two copies for each individuals. So that's why I have a sample of tea in groups because if I have enjoying the individuals in Group G. I would have two copies for each individual soiled have to copy So OK here I am just looking at the difference squared as a measure of. So the distance measure is for me. Chris that the light data. Now they are similar to the him in distance one but here the difference is that in the him in distance one we are looking at categories and see if they are different or not. Here we since we are measuring the number of repetitions. Now we have a point that of response and we can just look at the difference since in the square it. OK but the same thing goes on here with the. These for the pulled sample in the can decode pulls the pull sample distance in the with the in between distance. And. OK. The same thing happens here with the we decompose the with the in and between measure in we are trying to find out their syntactic distribution of these statistics. The goal is to get a statistic with an esteem started distributing that we can work with to actually do the test. So all the homogeneity testimony satellite simplest. We can work with this. Parameter here. Which is actually the divvy us in the expected number of mutations due to the population diverse because here we have to work. Also if I don't know if you have worked with color sister you or phylogenetic tree just see how the evolution goals in in a species or a genius. So behind all of that theory. I also take that into account. That's why I had that T. that counts for the time because at the time goes the genes kin evolve toward the species whatever we are working with. So. The the task would be if that if the the divinest in the expected number of mutation due to population diverse and divergence in a specific local say L.. If the. It is equal to the devious in the number of mutations within population and I am under a earned their no hypothesis. I'm saying that this is zero. OK so the test that district would be this very similar to the one in the human distance just the way that. You have the better are a little different. I'm going to skip those formulas because this is just a simple theory and. With own the here already with this right now. So let's go to the to the application which is the more interesting part. For the first the plea creation less think about D.N.A. sequences. So I would have Case one. With eighty seven sequences which we have fifty eight homosapiens in twenty nine other primates the goal here is to see if our test works. We'll black to know if doing our test if I can actually find difference between these two groups of K. so. Their interest is to find if the not all have partners would be that there is no difference between the school. If we reject that hypothesis. So it means that it seems that our attests is working exactly detecting the difference. So to say that. We also got another case where we have fifty eight this fifty eight sequences. We just regrouped into two so groups of twenty nine think with each. And. In this case we did two thousand bootstraps. We simple. Took two thousand times in case three is the same S.K. still with just the most rap is with ten thousand is a huge issue. So it might be nice if not this fifty eight sequences before I started the thread. I divide it. But under no hypothesis then I have to the distribution the Puca distribution would be in that under the not hypothesis on the nor her part as I don't know which group they are they are in the board sample. OK but then I I would do the test knowing that they are for one is from one the one set is one group and the other I separated first. Then I will do the test. So these are the observed values of the test at districts. For each case. The numbers doesn't mean that much but if you. Now look at the the believe their use. You'll see that for a case once when you do two thousand bootstraps you reject the hypothesis of home with unity of the groups. That's what we wanted to detect that the groups were actually different it was the homo supposition the other primates also all. But if we do ten thousand bootstraps. It's the same thing. We also reject in for cases two and three and we are not supposed to reject the hypothesis so the P. value. Is being it means that we're in we don't reject the hypothesis of who was unity of the group because they're actually the same group and just divided to see if the test was working so. This is just the difference in for case one the imputed distribution for two thousand and ten thousand there is not much difference so it's just a C. in there you don't need that much bootstrap to in order to your test words. So that the second application. Is. That data from the genetic analysis workshop I don't know if you have one. Already about it. It's the workshop in the United States that. Every year I think every two years they have saw. For each workshop they send you a set of data and you can work with that data and then go to the workshop the way you want each person in each group on the lies that they know according to the goal that you would think would be nice to work with the data in. Later on this data is public and you can use for any purpose. You weren't so I got the data from the this D A W workshop number fourteen. In a saw that the dataset would work for in the methodology we were working on. So this data is. From a study about alcohol in they had all the satellite data according to ethnicity in alcohol Mincey so they had the whites black in others and in the others they had his spending the and the the others were small groups so that's why you group them here. Then they they also separate are separated by if they were affected if they were considered alcoholic. This are these are the unaffected. There are non alcoholic with no symptoms in. These are the pure unaffected. Because they never drank the nice these this group here. The head drink but they were not affect them they had no symptoms so. We had a total of two hundred nineteen McNicholas satellite low side. Were this data set in the interest to verify whether the groups defined by ethnicity and level all of our clues in our home was genius at each locals this test here. Does for each locals. It's doing a test. So you have you are testing at the specific locals. If there is variability or not among the groups. You can do for the home. There is a way of doing for the whole data I mean just Eddie now up in taking the average over almost side. But since each local recital. Satellite data has a lot of variability. They say one can you don't see the difference in each locals specifically if you were Darkman take the average. So that's why our test is taken for each local So because of that. Sorry. We had to correct the significance level here because we are doing several tests for each local So we do one so it's like doing multiple comparisons like in the nulls of various when you do multiple comparison. So that's why we've corrected Bible for the significance level here but more from me. So here. Notice that when you. We want to know the difference in groups. We have three go three groups of that nice to hear. And three groups of our colleagues. So we don't have two groups anymore like we had in the first oblique ation so we notice that jackknifed works better when you have more than two groups we have done some study in where you believe we found out that when we have more than two groups the best thing to do the resembling technique. Better to use is jack knife. That's why we use jack knife here. So here are the. It's a ninety five percent confidence interval for our test the district. Looking at one specific law course is this slow course. Over here. So we can see that. According to the course of this interval you see the zero it's way off. So the zero is not in the interval so we detected some differences aren't showing here just the ones that are more interesting because we did a lot of. Test with many different laws. So this is you can find that there is a difference between the group of the ones that are affected in the non affected the and affected with no seen tons and this is the never drunk with power or with the affected in this is the never drunk in the In effect and with no symptoms. So at this locus we could see that it seems that there is a difference. Between those groups because after you see if there is a difference between the among the three wound. Then you have to make the move to Comparison Test between one in the other and that's where we corrected the significance level. For this one. This is another locos we could see there it is we know this big difference here. But this one is below zero. Well but there is a difference here as well but not as much as here or here maybe but it's something that someone that knows more about the data has to discuss about it in see what's happening in this locals. Also this one is for differences in ethnicity. You see that there is different is a big difference between blacks in others also blacks and whites in white in others. Remember that on the show it's not that there is difference everywhere I'm showing just the last side that I found some difference here. The variability is much greater you can see that the confidence intervals here are wilder. So in that specific locals the variation is high. But you can still see some difference between blacks and whites but not here. Blacks and others. You cannot see the difference anymore. It is neither here only between blacks and whites here. Here and you can see a difference between blacks in our various but not really between blacks and whites in you can see between whites and the others. Again this is another locos blacks in white there is no difference. And we could see that if we had many different sept of. This of figures like there but them showing just the ones that we could detect more in difference. According to the locals you can see the thumb locals are much more variable than the others. So that's why we were doing separate tests for the locals. So here are some of the references. This is about coalescent theory. This is about you statistical theory. Bork. This is the paper. That uses that data set that up the geisha in with the homo sapiens and primates and that's when we develop that that test for D.N.A. sequences. These are all for a synthetic results of the test the distinct fusing used to sex and this is the last one the one with maker satellite. This is probably appearing in it was accepted in two thousand and ten. So it's already online but it didn't appear in paper yet if you can get online already and I think that. Thank you very much. Thanks for all reference. On you. Let me ask one question you said you were at all of that on that you have this what it was you just didn't say yes. How was school by Clyde saying something about how it should be this so. So are these various or or or have you got rid of residuals. Yes because the overall US has like five percent confidence of five percent level of significance if you do the both are on a test you have to divide by the number of tests you are making if you're making like if we have three groups of three three groups and you're going to see if group one into our different one in three two and three. You're making six comparisons. So you have to do. Vied the level of significance five percent divided by by six so to be significant. It's harder is if you go. Yeah it's much harder to get significance when you divide the level signal does by five percent because instead of all having a big value of point zero five. You need a P. value of point zero five divided by six to be significant. OK I don't know if you follow just to try to explain the ratio between what. Why is there. Can I turn off one of the this this. Or just what's the difference. When you combine Yes you can. Yeah there are tests that in the paper I make a distro a discussion about. If you want to combine the test because that was one of the questions of the referee. Yes if we want to combine or the low side of what we do do. And so on. So there is a discussion about it. With Thank you.