On the difficulty of achieving differential confidentiality in practice: user-level guarantees in aggregated location data


Contribution of each user to the dataset

The guarantees given by Bassolas et al. are based on the assumption that each user does not contribute more than one trip to the dataset. Analyzes in the paper strongly suggest that users contribute more than one trip to the dataset.

Indeed, the final dataset used by Bassolas et al. contains connections, numbers of origin-destination trips, with at least 100 trips each. The dataset contains for example 46,333 connections for Atlanta (population 5 M). Assuming each user reports exactly one trip, at least 4.6 million people in Atlanta would have to contribute to the dataset to get the number of connections reported in the dataset. Since only 67% of mobile phone users in the United States use Google Maps as their primary navigation applicationseven, we find this unlikely, which strongly suggests that some users have contributed more than one trip to the dataset.

Regarding unique trips, the authors later confirmed that each user provides a list of their unique weekly trips to the weekly aggregate. If the same trip (A → B) is made several times in a week, it is only counted once. In the case of one of the authors who made 39 trips in a given week, this means that he contributes 32 unique trips to the weekly aggregate while 7 of them would be discarded. Note that here unique refers to trips that are unique for a given user during a given week.

Generate trips from empirical data

We use a set of longitudinal mobility data extracted from CDR data. Each individual path contains points with time and approximate location (antennas). We segment the trajectories using a win-win approach, selecting the most used location for each hour, and define a trip as movement from one location to another during the consecutive hour.

Execute the attack

We follow the procedure described by Bassolas et al. to aggregate the anonymized journeys: calculation of the origin-destination counting matrix for single journeys, add zero average Laplacian noise at scale 1 /?? at each entry, and delete all (noisy) counts less than 100.

We then use the attack model the authors rely on to calculate the 16% increase on a random estimate: the standard membership inference attack with perfect knowledge. In this model, the powerful attacker has access to all records in the dataset, except the victim, and ancillary information about the victim.

More precisely, for k between 0 and 70, we select a user you with exactly k trips. The attacker performs a membership attack to test whether the anonymized data Dthey received is D+ (anonymized trajectories with you included) or D (without you included). We calculate the local origin-destination matrix A(you) for the user you and, by linearity of the noise addition, calculate the normalized matrix A(D) – A(D) generated from no user or you. We perform a likelihood ratio test to distinguish whether the normalized matrix was sampled from a Laplacian distribution L(0, 1 /) Where L(A(you), 1 /??).

We repeat this procedure 10,000 times for all values ​​of k between 0 and 70 and report the average in Fig. 1.

Theoretical limits

The theoretical bound reported by Bassolas et al. is achieved by limiting the posterior probability of an attacker trying to infer whether a user is in the dataset, ??(Yes). Formally, let D* be the tested dataset, D+ dataset with user you, and D data without you. If the attacker’s predecessor does not have any information (for example, when P[D* = D+]= 0.5), we then have for all Yes (and for M an ε-DP mechanism8):

$$ frac { pi (y)} {1- pi (y)} = frac {P[{D}^{ast }={D}^{+}|M({D}^{ast })=y]} {P[{D}^{ast }={D}^{-}|M({D}^{ast })=y]} = frac {P[M({D}^{+})=y]} {P[M({D}^{-})=y]} the {e} ^ { varepsilon} $$

which then implies ??(Yes)?? e??/ (1 + e??).

Conservative estimate of loss of confidentiality

To estimate the loss of privacy for any user in 1 week of data, we assume conservative limits: each user contributes only once to each count and does not make more than 70 unique trips per week (10 per day). Let mtrips be the maximum number of unique trips that any user could contribute to the data, then the L1 the sensitivity of the counting matrix ismttorn. Add a tour (1 /??) noise and low number filtering involves ( mtrips× ?? , 2.1 × 10−29) – differential confidentiality by direct application of simple dialing limits9.

Likewise, the privacy loss for a year of data can be estimated as the sum of the privacy losses for each week. A reasonable estimate of the total loss of data publication would thus be 52 times the loss of confidentiality for a week, ??total= 52 × mtrips×?? = 2402.4.

Note that although better limits can be obtained, they require higher values ​​of??8.9. In this specific case, the acceptable values ​​of??would require prohibitive values ​​of??rendering guarantees meaningless in practice.


Comments are closed.