School of Computing COMP3300 Assignment Project Assignment marks: 18% of overall unit marks. Due date: see iLearn page. Objective: To gain experience in evaluating a dataset for privacy and utility so that the relevant privacy law is respected; select a privacy preserving technique to protect identified vulnerabilities. Please note: This assignment specification aims to provide as complete a description of this assessment task as possible. However, as with any specification, there will always be things we should have said that we have left out and areas in which we could have done a better job of explanation. As a result, you are strongly encouraged to ask any questions of clarification you might have, either by raising them during a lecture or by posting them on the iLearn discussion forum devoted to this assignment. Transport data release The Westeros Rail Network, illustrated in the map above, serves the Westeros region. In order to improve its services eg. whether to upgrade station services, it collects information about journeys via the “tap on/tap off” information it logs every time a passenger goes through a barrier at a Westeros Rail Network wi#Universitas # # * 1¥ ?m ¥Rock# * B”## District ¥ centre # ¥School ? Landing * # Par station. There are, of course, a multitude of journeys taking place every day, but due to the often uniqueness of a journey it could still be the case that journeys could be used to identify particular passengers. To serve the needs of transparency (so that the station upgrade can be explained to the public), the Westeros Rail Network is planning to release a “de-identified” dataset. The Westeros Rail Company would like to know whether their proposed de-identified dataset satisfies the Westeros Privacy Law: “No individual can be re-identified in any publicly-released dataset. Re-identified means that a record can be linked with very high likelihood to an individual using information which could be plausibly obtained.” It has hired a team of privacy experts to determine whether or not there are still vulnerabilities in the dataset that could be construed as a breach. If any vulnerabilities are discovered the team has been asked to make recommendations for how the data should be changed by applying an appropriate privacy technique, to ensure that the core utility of the dataset is not destroyed. The Valar Morghulis Travel Card The Valar Morghulis Card (or VM Card) is what passengers use to go through the electronic barriers at a station by tapping on or tapping off. Each time they do so, information about the journey and the card is collected by the Westeros Train Network. Data collected about the card When a card is issued, it is assigned a Card Id, and a card type. The Card Id is a four digit number and is unique to the card. It is used by the Westeros Rail Company in its internal accounting system by, for example, linking it to a customer’s credit card. This also makes it easy to issue a customer with a new card in case of loss: they simply create a new card with a new Card Id and link it to the customer’s internal record. The separation of the Card Id and the customer record is how the Rail Company hopes to abide by the Westeros Privacy Laws. Each card also has a type, such as Adult or Child. The complete set of card types is shown in the table on the left. Each individual who has a card is actually quasi-identified by their Card Id. However individuals do not know what their Card Id is, so in this sense, the Westeros Rail Company assumes that the separation between the real passenger and the Card Id is sufficient to de-identify the data in the transport dataset (described below). Data collected about the journey Whenever a passenger either taps on OR taps off, the station, and the tap on/tap off time and day is logged as well as the card type and the card Id. This is communicated to the Westeros Travel Dataset to create a dataset consisting of records, where each record is a complete journey including the tap on time and station and the tap off time and station. The Westeros Transport Dataset Below is an example of a few records in the transport dataset that the Westeros Rail Company is planning on releasing: CardID, CardType, Touch On: Day, Time, Station, Touch off: Day, Time, Station 7378, #6, Thursday, 1121, Stadium, Thursday, 1131, Braavos 7378, #6, Wednesday, 1843, University, Wednesday, 1902, Braavos 6150, #3, Wednesday, 227, Police, Wednesday, 235, Braavos 6150, #3, Wednesday, 1856, Casterly Rock Shopping, Wednesday, 1918, King’s Landing 3090, #1, Tuesday, 1411, Law Courts, Tuesday, 1455, Braavos This shows five journeys of three different passengers (identified by the CardID). The first record, for example shows that a passenger with CardID 7378 tapped on at 11:21am on Thursday at the Stadium Station, and then tapped off at 11:31am at Braavos Station. We know that this passenger is travelling under a Senior Card. The first task of the privacy experts is to determine whether the records can be de-identified. (See below for details.) A Public Dataset: Twitter The privacy experts have pointed out to the Westeros Rail Network that they must check to see whether any record in the Westeros Transport Dataset can be linked to individuals in a public dataset. They have proposed using Twitter posts to do this as they are easily available, and include posting times as a potential attribute that can be linked. An example of a twitter post is shown below: [email protected] — Just got to the Business District with interviewing for the new season of The Apprentice Westeros!! Looking forward to hiring and firing. Posted 15:43, Friday. The Assignment In this assignment the privacy experts first carry out an assessment of privacy vulnerabilities of the Westeros Transport Dataset which can be downloaded from iLearn in a file called WesterosTravel.csv. It consists of journeys logged during a single week, and each record is as described above. Notice however that the times appear simply as a number between 0 and 2359, to make it easier for you to use filtering in your analysis. This means that midnight is represented by 0, and 5 minutes past midnight is simply 5. For all other times of 2 or more digits, the last two digits represent the minutes after the hour and any digits preceding those represent the hour. So 946 is 09:46 and 1327 is 13:27. Task 1: Role Description In this task, the privacy experts have asked for volunteers from the public to help them. Each volunteer travels regularly on the train. They also often travel with a friend or colleague. In Task 1, you will take on the role of the volunteer. To find out who you are, click on the Character Specification under the Assignment tab in iLearn. The specification contains a short description of your journeys. Your journey description will also include some information about a friend or colleague that sometimes shares all or parts of your journey. You are also provided with a famous individual’s identity who has a twitter feed, and you are provided with a public post from that person. Task 1 (a) : Re-identify your journeys (10 marks) For this task, use what you know about your journeys described in Task 1 to identify your Card Id. Task 1 (b) : Re-identify your friend’s or colleague’s journeys (10 marks) For this task, use what you know about your friend or colleague’s journeys to identify their Card Id. Task 1 (c ) : Learn a new fact about your friend or colleague (10 marks) For this task, identify an additional destination travelled to by your friend or colleague that is not included in the description in Task 1. ie. It should be about a different journey that is not part of their regular routine. Task 1 (d) : Perform a linkage attack with twitter (10 marks) For this task, link the information in your given Twitter post from Task 1 to some journeys in the dataset to identify the Card Id of the Twitter poster. Task 1 (e) : Summarise your findings (15 marks) Summarise the vulnerabilities that you have discovered by answering the following: (a) Describe what you did to re-identify yourself and your friend/colleague. (3 marks) (b) Describe what you did to find out a new fact about your friend or colleague. (3 marks) (c) Describe what you did to re-identify the famous person. (3 marks) (d) Explain whether these attacks constitute a breach of the Westeros Privacy Law. (6 marks) Your summaries for each part (a)-(d) above should be 1-2 sentences each, and contain sufficient information to show your understanding of the re-identification attacks you performed. Task 2 (15 marks) After reporting the vulnerabilities discovered by the volunteers in Task 1 the privacy experts are told by the Westeros Rail Network that the most useful part of the data is the number of tap on/tap off times within a time period at any given station. This is their stated utility. (a) Select which attributes of the transport dataset are not needed for this utility to be still achievable. (5 marks) (b) If the attributes you identified were to be removed, explain whether the vulnerabilities identified in Task 1 are still present. (5 marks) (c) What would you (now in the role of the privacy expert) report to the Westeros Rail Company about whether the modification suggested at part (b) above preserves their their stated utility? What would you report regarding the modification’s ability to abide by the Westeros privacy law? Briefly state and explain your overall recommendation regarding this modification. (5 marks). Task 3 (15 marks) (a) Choose the most appropriate privacy preserving technique from the choices (i) and (ii) below. Your technique must preserve the following utility: it must enable the number of tap on/tap offs within each 10 minute period to be preserved approximately. (i) Generalise the tap on/tap off timings to intervals of 5 minutes by rounding to the nearest 5- minute period. So a time of 09:12 would be reported as 09:10 and a time of 09:14 would be reported as 09:15. (ii) Suppress the Hour in the tap on/tap off timings i.e. only preserve the minutes eg turn 09:45 into 45, and turn 18:45 into 45. Indicate in your answer on iLearn which method you have chosen. (1 mark). For the choice you made above, apply the generalisation technique to the dataset and then answer the following: (b) In the dataset you have just created: 1. At station Stadium, what is the average number of tap-ons between 4pm and 6pm on weekdays? (ie. count the total number of tap-ons at Stadium between 4pm-6pm on MondayFridays and divide by 5). (3 marks) 2. At station Braavos, what is the average number of tap-ons between 9am and 11am on weekdays? (3 marks) 3. At station Winterfell what is the average number of tap-ons between 10am and 2pm on weekdays? (3 marks) (c) Referring to the Westeros privacy law, and by comparing your answers from part (a) to the corresponding results on the original dataset, would you recommend the approach ((i) or (ii)) you took to balance the trade-off between privacy and utility? Explain your answer. (5 marks) Task 4 (15 marks) An alternative approach is to apply bounded noise with B=5 (selected uniformly) to the tap on/tap off times. For example, if the time is 09:45 then the possible outputs are 09:40, 09:41, … 09:50, each with equal probability. (a) Assume that this perturbation technique will be applied to the dataset. For the first two scenarios described at Task 3 (a) (i.e. 1&2), calculate the probability of incorrectly reporting the average count. (Hint: How many people may be incorrectly excluded/included in the count? What is the probability of excluding/including these people?) (8 marks) (b) By referring to the corresponding scenarios in Task 3 or otherwise, what method of anonymisation would you recommend so that the Westeros Privacy Law is respected? Explain your answer. (7 marks) The Assessment This assignment is structured to allow you to decide how much effort you want to expend for the return in marks that you might hope for. You can choose which parts of the privacy assessment to complete, however do note that later Tasks cannot be completed unless you have completed Tasks that precede it. However you can decide upfront whether you are shooting for a pass or a high distinction and know exactly how much work will be required to obtain that mark. Here is what is required to obtain marks in one of the performance bands for this assignment: Pass: Completion of Task 1. . Credit: the P level implementation + completion of Task 2. Distinction: the Cr level implementation + completion of Task 3. High Distinction: the D level implementation + completion of Task 4. What you have to do Download the dataset WesterosTravel.csv from iLearn; view the Character Specification to find out your identity, partner and famous twitter user and information about their journeys. Open the file WesterosTravel.csv. using a spreadsheet program such as Excel (or any program of your choosing) and use its filtering function (or other means) to help you find privacy vulnerabilities in the dataset. What you must hand in In the submission page on iLearn for this assignment you must input the answers to your analysis above by the due date. This is in the form of an iLearn Quiz. Some of these answers will be automatically marked and so it is important to abide by the input instructions. You have exactly two attempts to submit the quiz, so please only submit when you are happy with your answers. Late penalty It is important that you hand your work on in time. Please note that a late penalty of 10% per day will be applied unless a valid Special Consideration is lodged.