Hi everyone my name is right here peaches student and today I'm going to present the study that we submitted to India is this. Talking about the privacy leakage in personalized mobile in that advertising and before we dig into the topic we'll have to understand we'll go through the bigger picture of mobile advertising for sure. So they're basically like two parties. Whole picture. In order for the publishers to make profits publishers to make profits they can actually published Pay App you know and whenever the users download their app they will get paid. But have you ever when the. Developers. Going to survive so that's when the network comes in so basically the networks actually clicked the information from the user from the users. Through their mobile devices and delivered the personalized to it application so when users see the app see the ad in their application and click through that. Network and the publishers work the profits. So. For most users I believe they will tend to choose the free app in rather than paid applications simply because they don't want to spend the actual penny on the for paying the application they use after purchasing their mobile device. But I question is. Really free meaning that they really are costs to the users. So as we said advertising is an essential part of the ecosystem of mobile application which apparently creates a win win situation for post. Vipers and the users so users don't have to pay money for using that. Developer still get paid for users fusion their application but. In order to make a more accurate personalize. Advertising model and their work will actually collect a lot of information based on the user profiles that they have and since to a lot of more and more users users today we're just wondering if there are any privacy concerns. So. Lots of studies have being done with advertising in the past few years. So wrapped advertising basically the web ad are delivered through the I Frame which is isolated from the Web publishers based on the same origin of policy underway meaning that only the users can know that content personalized but their information but it is totally different for the mobile app for the Mobile case for mobile platforms first of all mobile devices are different from the personal laptop simply because they're more intimate to users users tend to carry dear mobile devices around wisdom. Of the same time network or. Their mobile device will record their information such as your graphic location. Their movement and so on those information will be transferred to Dia and that were to personalize their app and what's most important is that it's really in the same process as the Absalom self and they're having sharing the same privileges meaning that developers can access their personal access their personalized ads coat and possibly reverse engineer them in order to know about user information. So a lot of pasta a few actually a few POS research has done on the mobile advertising. Using thing to. User profiles to the mobile device advertising so. In other words you are trying to craft a user profiles you know other to see what kind of content those use that users are seeing So for example they're in their study they're trying to download more gaming apps in order to protect their. You know to pretend to be their teenagers but in fact we don't really know what is the actual signal that triggers that we're to believe that you are a teenager user and in this study. They showed that only the user interests are used at personalization but what about those demographic information such as your. H. gender or even income level so in some sense we believe that demographics are more sensitive than. User interests. As you can imagine I would kid people knowing me loving sports. But I will be more concerned about people knowing me how much money I make. So due to goes in a kind of study first the first goal is to study to what degree are in that personalized target different attributes of users and for the different attributes of user in the current study where targeted we're focusing on the interest profile and also that demographic profiles of users and the second goal is to understand how much can an adverse tree one end up developer learn about the user observing the content of that. So we're limited to limiting our work to the following work scope we're focusing on the Android platform which is the majority part of the market share in mobile devices. And we are also focusing on the mob which is. The Google mobile ad network which is also a majority part on the Android devices and we collect the following information from the users first of all we collect their demographic information although Google indicated that they're only the only person the only personalized ads based on the following three targeting options in terms of demographics we're measuring more demographic Stan those three simply because we want to test whether there are any correlation between the ad content and the other demographics and we also collecting user interests which are. Basically twenty two labels provided by. Control A.P.I. to developers such as sports games luxury news and saw. So for the methodology as we. Address before this is different from the previous approach because we're recruiting the real users in the current study we collected as we said we collect their demographics and interest profiles from the real user We also collecting that content through the application on their mobile devices in order to see what icons and they are seeing in other words we're trying to build the ground truth instead of using the think that a user profile. For the recruitment we're recruiting people through I was on the kind of cuts work and we asked them to in-store and draw an application on their mobile device which will be talk which will talk about in the next light and we also asked them to complete the service for gathering demographic interest profiles of users. So that we can get the interest label as well as the demographic labels on each you. That is included in the current study. For the application that we want them to install it is basically a blank. Blank application which which does nothing but initiate one hundred requests to that network was setting any targeting retribution for that control it is also a V.P.N. connection. It is also a V.P.N. tunnel that connects to the server and transfer the traffic back to our server fun now since we specifically asked the users to turn turn on for five to fifteen minutes and I asked them to avoid using the app application through the B.P.M. during the data collection this is because. Other application store on the mobile device is the developers tend to said. A control. Order to have more targeted ads for example a new sports news application might set the sport at control in the first place so that they can get more targeted sport related petition simply because they know that users who love sports news tend to love other sports products. And to in order to collect the information we need to understand the format of mobile ads so basically the mobile ads currently the mobile ads are nothing but just as steam or source code and the representation of the mobile apps can be defined as the Lenny you are which is the destination you are the user agent will be redirected to. Be redirected to after clicking on that so let me you know is actually part of the S.T.L. source with the following key attributes as you can see in the source code. So in order to categorize each or each at we we we try to categorize them into one of the twenty two interest categories that we mentioned before provided by Google as targeting options we. We quoted those Lenny you almost to the public available A.P.I.. To the public available A.P.I. to the. Database in order to get the interest label on those advertised advertisement and if we don't get any results back from the database we can we can we have three researchers manually labeled those ads. You want to have the interest label on each advertising. So in general we have just this is just a overview but we've got two hundred seventeen your user profiles recorded and we collected around one hundred ads per subject seven hundred unique. Unique ads or you have been collected seventy seventy two percent of the Google seventy two percent of them are from the Google Play store simply because it we're testing the experiment on the mobile. Platform using the Google Apps network. In order to do the interest based personalization measurement we conduct two measures first we're conducting to position measurement which is basically the interest label user interest label the intersection between the user interest label and interest label over that of the AT So basically in other words it means precisely and at network to know about the user's real interests we're also conducting the recon measurement which is the intersection of the over the set of interest label of users. And other words that means the network's coverage of the real user interests. Based on results we found that seventy nine percent of users have a twenty one percent of any. Just categories I guess correctly and we also find Levon percent of users have at least eighty three percent of interest categories correct we're also find that Google could cover at least half of the real user interests for sixty percent of the users and in general it means that mobile ads are highly personalized based on user interests. For the demographic personalization Hiver we first need to group the subjects in two different demographic groups for each demographic category. As you can see we're grouping This is the distribution table for each demographic groups and we basically dividing sub just based on the demographic attributes in two different groups for example we're dividing people into two gender female and male and the percentage actually the distribution of the subjects in each demographic category. And after we divide it into two divided them into those categories we conduct number of times and add to those users in different demographic groups by using stats to cool now statistical analysis for independence measurement we can determine whether And that is correlated was a given demographic at a given demographic category so for example if we can see if we can seen an ad that is shown equally between female and male then we can see that we can say that this ad is not Carly that was gender at all. Based on results not surprisingly gender is the most correlate at one. With content followed by the parental status which is advertised by Google anyway but surprisingly many other many other ads are dependent on the user. Income level which is not even mentioned by Google as part of targeting options advertise on their websites we're not saying Google collecting more information. Than they need it but it is possibly it is possibly the result of the correlation between income and other demographic actually it's Which causes that in. Significant correlation between income and that for example if if people with children tend to have more tend to have higher income level. And. In this case we can conclude that it's all personalized based on some of the user demographics but not all and we're not sure whether Google actually collecting more information that they have advertised to. So after we see all those strong correlation we're asking whether they're in the privacy leak or just going well. So in particular we are trying to ask whether adversaries can predict the real user personal information based on that content only in particular we interesting demographics simply because we consider it more sensitive but I do know that we apply the machine only I was amazed to put. The models to predict the users demographic information from the content and valley the accuracy so we can imagine it as a simple game for example like when you are seeing a subject how likely based on the graph based on the table that we're seeing here how likely are you going to guess correct about this person's gender so for the random guess it should be fifty percent. Simply because there are only two groups of generous that we divide it so fifty percent it's fifty fifty situation but what about the educated guess. So based on the distribution that we have the educated guess should be fifty six point two two percent simply because we're using the majority selection strategy and we can in order to achieve the high accuracy we can get we should choose the we should guess the subject as well. So this is actually what we do as the base case for the bet me for the dummy classification we use the random guess that we described in the previous flights and for the argument to me we. Used the educated guess as the percentage. Accuracy and these are just a different machine than any models that we train for using the content and demographic information from the users and here's just the classifiers and the accuracies across different models and as we can see surprisingly gender age Apprentice that is the most predictable once simply because they are advertised by Google for targeting options but also some other. Some other things that we can predict. For example like the income level. Which is more sensitive than the above three we can predict as we can we can get the accuracy of prediction higher than the. Dummy or the augmented dummy again those could possibly due to the strong correlation was the other three. But in general we are seeing the privacy leakage there and we can definitely say that if we can see the content we have to high accuracy or high pretty high accuracy of prediction to guess right about other actual buttes of users. So. Kind of measures since we're seeing the. Channel Are there any cars. Measures some people have proposed to use as a G.P.S. to encrypt the content that content is basically to just the H.T.M.L. source that we have seen but we're arguing that it's it is not used for in this case. Only apply to the communication channel between happen at networks that content will still be plain text at a time when the rise in the app and being displayed to the users so the app developers can actually can still use their code to access plain text H.T.M.L. source code and get the ad in order to know who you are and there's some other kind of measures being proposed for example like isolating the related code from the code of the hosting up as well as separating their level but these could be the possible future works that we need to do. So we also realize there are a lot of limitations for the study first of all we have relatively small sample size this is simply because we it is hard to recruit people across the Internet and we could only get two hundred. After a million aiding those are liars and by its results we can only get two hundred users in that period of time and even if we increase the sample size we're arguing that there could be largely an even demographic distribution of the data set which could be also be bad for our findings and for future studies we could do more mobile ad network instead of focusing just focusing on the Google at to test whether there are any privacy leakage using the same message to test whether there are any approach other privacy leakage for those networks. So in conclusion we find we find a strong correlation between. Both the demographic. The demographic interest with that content and we also find a lot of privacy concerns from the information classroom procedure through the personal observation provided by a network and that's pretty much it thank you. Yes. That's simply because an network. Can only encrypt the traffic from to and now we're. Up but developers for example we came adjenda app developers as adverse freeze and they can when they can use their applications when when to add deliver to the application it is it is it got decoded at the end point so they can actually use their application code to access those. Ever Tice mobile advertising the most source code to get. And then do the machine learning algorithm. This. Could be a lot of things for example. For the mobile ads for the mobile device of course when you're using Android mobile devices you have to log in to your Google or G. Mail account which is which which is already full of your information and based on your behavior. Like. Using those mobile devices for example like the application that you can store hollow you're using each application and so on which we don't really know but these could be the possible information that the mobile device so collecting and then transfer back to that network you want to get a more personalized advertising. Yes yes only those only those from Google or somebody from that network could possibly know. Yes. Yes of course of course but then we assuming that Google has the most part for. The capability of doing that simply because it is Android platform and. Yes. Yes. Yes of course it is it is available but. Not here but I can email you later or make a developer later. The training size I think we're using the ten fold. So basically the training size is ninety percent and the testing site is ten percent and then we're feeding those showing algorithms with. That and also the demographic information demographic labels of the users and then we seem to. They feed it and then when they are using the different models of. Machine learning algorithms we have the clip classifiers to I put the accuracy of guessing those attributes. And we were comparing the was the base case which we talked about about the dummy and augmented dummy in order to see whether we have significant I could have seen meant compare with the base case. Actually the distribution is here so we actually categorize the age into five groups. And then the distribution is listed as well so we categorize each demographic attributes into different groups and then we run those machine learning model models. To get accuracy. So basically for this study we're just proposing a different approach as the previous study which uses the crafted user profile we instead we're using the real user profile and for this study we're just trying to like. Use a new approach or more accurate approach for to study those and now we're. And we're just proposing this mats instead of conduct so instead of conducting further research our other and now work we're just proposing dismounts it's for others to use you want to study more about mobile advertising. No Google actually only as as we said before like Google only. So Google have targeting options based on demographics and interests but for demographics they only clear make the claim that they're using gender age and printer status as targeting options and for interests it is those twenty two interest categories study used for targeting options and D. didn't I don't recall to mention any. Privacy leakage or anything like that. Yes. Well. Not so far because we just made it to the conference and then just got accepted we still need to present at the conference to maybe to get. You know Google network yes Google has its own network. So basically it's either Google or Dia and i work it's and I work can learn those information. Yes. Develop and to plot those. It is a bit hard but in terms of the privacy leakage as you can see there are a lot of correlations between those demographic information so it's really even if Google advertise that they're only targeting on those three attributes but in fact there are a lot of correlations like intersections or correlations between those demographics so we can't really eliminate those. Privacy leaks which from the advertising standpoint but then for the end point. When using the mobile platform we can definitely do something on the mobile platform to eliminate those privacy leakage for example like separating those codes and setting those different privileges you wanted to allowing disallowing those prayers to access the. Content that you're seeing so we can only do from the end point. Well. So for the. For the block is actually for blocking those as it's being displayed to the users so for. At the publisher's cannot act. Access to that content because simply because it is using it is used by I frame and they cannot actually access. At. All or anything like that they have no knowledge of that we are concerned then concerned about users seeing that. But we are like we are concerning more about like the app developers or the web publishers see you at ten yourself seeing that. So the app lock is actually to do users like facing the users it's not really. Yes. Yes maybe but. Life. Goes on. And. I feel. Like. Yes yes so you can do. Something. Dramatic or side. Yes yes yes. Yes sure. You. Can you connect that to real people. Yeah. Yeah so basically I can so for example based on the results we show we can somehow guess correctly about your gender your age you know I could. Yes. And. Definitely a lot of the linkage attack to link those database in order to guess who is which people like which is which and we're not quite sure whether those information I used for those kind of linkage attack but. Here we're just trying to argue that a lot of the personal information of you have been leaked even though they don't really know what's your name I mean they can because you have conned and maybe you have the and put your name in the account but somehow they're just knowing that there's a person like this like even even machine owning can only get a vague profile of you and it is not like one hundred percent accurate they're trying to they're just trying to guess who you are what kind of people you are in order to deliver the correlated ads like the related ads to you so that you can maybe click through the ads and they will get profits. Yes. That's actually the like the behind Google doing the like the personalization is. So we don't really know like what's the specific I would them that they use to pinpoint a specific person but then that like imagine you have the mobile device which you can easily from Point your children cation and also your the movement how long you travel between each day so that they can possibly know where you live and where useful work and possibly with those kind of information as well as these demographic information they can probably pinpoint to a group of people. Yeah so that's basically it but we don't we have no knowledge of like the behind scenes like how they did the personalization of all. Yes. Yes you can turn off the G.P.S. location G.P.S. service and then. Dear. Yes. But then but then those information can actually be like we reverse engineered by the app developers it's not the Google that we concerned about you know. Yes or. Yes even. If. It is. Right here. Actually for the for the medical database they actually like to eliminate those sensitive information by processing those databases for example like eliminate those names and so on even maybe like your address zip code and those kind of things but then as I said like there are a lot of linkage attack going on between across those state a base so whenever you have to a database or multiple database they can actually link you trying to guess which one is actually correlated was which which entry is correlated with which not a body at a base so that they can get a more eye care profile if you maybe in one database they don't have your name simply because that's sensitive but then in the other database they might they might have your name with other relative information so maybe you can match it and then find out what who's actually who in a medical database. That. Yes. But it's also. Like. Yes. Yes. Yes. Yeah it's up to that purpose so we're OK Was we're swimming Google is a responsible company and we're assuming that Google it is OK for Google to collect our information through different ways but then we're. Like most people just not feeling not comfortable with sharing those information with other third parties such as the app developers. Yes. It's not encrypted at all at this point. Yes as long as you can intercept or like the traffic you can see the plain tax code and yeah. Thank.