/
Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords

Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords - PowerPoint Presentation

alyssa
alyssa . @alyssa
Follow
66 views
Uploaded On 2023-06-26

Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords - PPT Presentation

Jeremiah Blocki Purdue University DIMACSNortheast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness What is a Password Frequency List Password Dataset ID: 1003770

frequency 511 password data 511 frequency data password privacy yahoo 5546 521 311 321 911 7656 545 624 421

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Releasing a Differentially Private Passw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! PasswordsJeremiah BlockiPurdue UniversityDIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness

2. What is a Password Frequency List?Password Dataset: (N users)password12345abc123abc12312Frequency12345abc123Histogram1password

3. What is a Password Frequency List?Password Dataset: (N users)password12345abc123abc12312Frequency12345abc1232FrequencyMost Frequent2nd MostFrequentHistogramFrequency List1password113rd MostFrequent

4. What is a Password Frequency List?Password Dataset: (N users)12Frequency12345abc1232FrequencyMost Frequent2nd MostFrequentFormally:Password Frequency List is just an integer partition. HistogramFrequency List1password113rd MostFrequentpassword12345abc123abc123

5. Password Frequency List (Example Use) Estimate #accounts compromised by attacker with guesses per userOnline Attacker ( small)Offline Attacker ( large)Password Frequency Lists allow us to estimateMarginal Guessing Cost (MGC)Marginal Benefit (MB)Rational Adversary: MGC = MB  

6. Available Password Frequency Lists (2015)Site#User Accounts (N)How ReleasedRockYou32.6 MillionData Breach*LinkedIn6Data Breach*….……Yahoo! [B12]70 MillionWith Permission**** frequency list perturbed slightly to preserve differential privacy. https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:* entire frequency list available due to improper password storage

7. Yahoo! Password Frequency ListCollected by Joseph Bonneau in 2011 (with permission from Yahoo!)Store H(s|pwd)Secret salt value s (same for all users)Discarded after data-collection million Yahoo! UsersYahoo! Legal gave permission to publish analysis of the frequency list 

8. Project OriginWould it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.

9. Project OriginI would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

10. Project OriginI would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

11. Available Password Frequency ListsSite#User Accounts (N)How ReleasedRockYou32.6 MillionData Breach*LinkedIn6Data Breach*….……Yahoo! [B12]70 MillionWith Permission**** frequency list perturbed slightly to preserve differential privacy. https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:* entire frequency list available due to improper password storage

12. Yahoo! Frequency CorpusLargest publicly available frequency corpus (that was not result of a data-breach)Source Dark Web 

13. Why not just publish the original frequency lists?Heuristic Approaches to Data Privacy often break down when the adversary has background knowledgeNetflix Prize Dataset[NS08]Background Knowledge: IMDBMassachusetts Group Insurance Medical Encounter Database [SS98]Background Knowledge: Voter Registration RecordMany other attacks [BDK07,…]In the absence of provable privacy guarantees Yahoo! was understandably reluctant to release these password frequency lists.

14. Security Risks (Example)???12345abc123abc123Adversary Background Knowledge

15. Security Risks (Example)???12345abc123abc123abc12312345other31222112

16. Differential Privacy (Dwork et al)  Definition: An (randomized) algorithm A preserves -differential privacy if for any subset S of possible outcomes and any we havePrfor any pair of adjacent password frequency lists f and f’, . 

17. Differential Privacy (Dwork et al)Definition: An (randomized) algorithm A preserves -differential privacy if for any subset S of possible outcomes and any we havePrfor any pair of adjacent password frequency lists f and f’, . f – original password frequency listf’ – remove Alice’s password from dataset

18. Differential Privacy (Dwork et al)Definition: An (randomized) algorithm A preserves -differential privacy if for any subset S of possible outcomes and any we havePrfor any pair of adjacent password frequency lists f and f’, . Small Constant (e.g., ) f – original password frequency listf’ – remove Alice’s password from dataset

19. Differential Privacy (Dwork et al)Definition: An (randomized) algorithm A preserves -differential privacy if for any subset S of possible outcomes and any we havePrfor any pair of adjacent password frequency lists f and f’, . Small Constant (e.g., ) Negligibly Small Value (e.g., ) f – original password frequency listf’ – remove Alice’s password from dataset

20. Differential Privacy (Example)20222212minus=ff’Subset S of all potentiallyharmful outcomes to Alice 

21. Differential Privacy (Example)21 222212minus=ff’Subset S of all potentiallyharmful outcomes to Alice

22. Intuition: Alice won’t be harmed because her password was included in the dataset. 22Differential Privacy (Example) 

23. Main Technical ResultTheorem: There is a computationally efficient algorithm such that A preserves -differential privacy and, except with probability , A(f) outputs s.t.  

24. Main Tool: Exponential Mechanism[MT07] Input: fOutput:  Assigns very small probability to inaccurate outcomes.

25. Main Tool: Exponential Mechanism [MT07]Input: fOutput:  Theorem [MT07]: The exponential mechanism preserves -differential privacy. 

26. Analysis: Exponential MechanismInput: fOutput:  Theorem [HR18]: There are partitions of the integer N. Assigns very small probability to inaccurate outcomes.

27. Analysis: Exponential MechanismInput: fOutput:  Theorem [HR18]: There are partitions of the integer N. Assigns very small probability to inaccurate outcomes.Union Bound  with high probability when . 

28. Analysis: Exponential MechanismInput: fOutput:  Theorem: with high probability. Assigns very small probability to inaccurate outcomes.Theorem [MT07]: The exponential mechanism preserves -differential privacy. 

29. (e.g., [U13])

30. The Challenge --- EfficiencyStrong Evidence: Sampling from the exponential mechanism is computationally intractable in general (e.g., [U13]).Naïve Implementation: Exponential time (distribution assigns weights to infinitely many integer partitions)

31. But, we did run the exponential mechanismTheorem: There is an efficient algorithm A to sample from a distribution that is –close to the exponential mechanism over integer partitions. The algorithm uses time and space Key Intuition:  Suggests Potential Recurrence Relationships

32. But, we did run the exponential mechanismTheorem: There is an efficient algorithm A to sample from a distribution that is –close to the exponential mechanism over integer partitions. The algorithm uses time and space Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,k such that 

33. But, we did run the exponential mechanismTheorem: There is an efficient algorithm A to sample from a distribution that is –close to the exponential mechanism over integer partitions. The algorithm uses time and space Key Idea 2: Allow A to ignore a partition if very large.  Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,t

34. Practical Challenge #1Space is Limiting Factor: N=70 million, Workaround: Initial pruning phase to identify relevant subset of DP table for sampling. Running Time: 12 hours on this laptop 

35. Practical Challenge #2 can get very large (too big for native floating point types in C#)Workaround: Store instead of . Important Implementation Question: Where do your random bits come from? Default random number generator is much easier for developer to use.Example: Rand.NextDouble() vs CryptoRand.NextBytes() 

36. Practical Challenge #3Does Yahoo! have any preference about the privacy parameter ? 

37. Practical Challenge #3Are there standardized guidelines to select ? 

38. Practical Challenge #3No, I was thinking would be reasonable…. 

39. Practical Challenge #3Yahoo! is fine with  Risk: Industry deployments become de facto standard for selecting Suggested Dinner Discussion Topic: What role should academia play in influencing these standards? 

40. Yahoo! ResultsOriginal DataSanitized DataNAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………Original DataSanitized DataNAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:

41. Yahoo! ResultsOriginal Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:

42. Yahoo! Results (Selecting Epsilon)Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………  Any individual participates in at most 23 groups (including All)

43. Yahoo! Results (Selecting Epsilon)Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………  

44. Yahoo! Results (Selecting Epsilon)Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8……………………… 

45. Yahoo! Results (Selecting Epsilon)Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8………………………Original Data [B12]Sanitized Data [BDB16]NAll69,301,3376.511.421.669,299,0746.511.421.6 gender (self-reported)Female30,545,7656.911.521.130,545,7656.911.521.1Male38,624,5546.311.321.838,624,5546.311.321.8………………………language preferenceChinese1,564,3646.511.122.01,571,3486.511.121.8……………………… 

46. New Result #1 (Mean Squared Error)Exponential Mechanism achieves good MSE: Improves on [HMJ09] by factor if . Improves on [HMJ09] by factor. 

47. Proof SketchLet , where Let , where Claim: Lemma:  

48. New Result #2 (pure-differential privacy)New Algorithm samples from exponential mechanism in polynomial timePro: Better Group Privacy GuaranteesCons: Not efficient enough for practice when n is large  

49. New Result #3 (Lower Bound L1 Error)Theorem: For any differentially private there exists a partition s.t       …………  ……  ……   such indices 

50. An Open ProblemApplication to Social Networks: Degree Distribution with Node PrivacyThConjecture: For  

51. Lower Bounds on L1 Error [AS16,B16]  relevant when  

52. Empirical Evidence (100 Samples) n=32.6 million users

53. More Empirical Evidence 

54. More Empirical Evidence  

55. Comparison with Prior Techniques

56. ConclusionsDifferential Privacy Enables Analysis of Sensitive DataThe exponential mechanism is not always intractable integer partitionsOther practical settings?Applications to Social Networks?

57. Thanks for ListeningAnupam DattaCMUJoseph BonneauNYU

58. Dinner Discussion TopicsRisk: Industry deployments become de facto standard for selecting ?.  

59.

60. Integer PartitionDefinition: A partition of an integer n > 0 is a non-increasing sequence of numbers 0 such that:   Example:  Fact[HR18] 

61. Integer Partition: Basic Facts Fact[Fristedt93]: Let then  

62. OutlinePassword Frequency ListPotential Security ConcernsDifferential PrivacyA DP Algorithm with Minimal DistortionReleased Yahoo! Frequency List

63. Password Frequency List (Application 2)Quantify Benefits from Key-Stretching Halting Condition (Rational Offline Adversary):Marginal Guessing Cost Marginal BenefitPassword Frequency Lists allow us to estimateMarginal Guessing Cost (MGC)Marginal Benefit (MB)Rational Adversary: MGC = MBCan estimate when the offline adversary will give up. 

64. Why not just publish the original frequency lists?Heuristic Approaches to Data Privacy often break down when the adversary has background knowledgeMassachusetts Group Insurance Medical Encounter Database [SS98]Background Knowledge: Voter Registration Record

65. Why not just publish the original frequency lists?Heuristic Approaches to Data Privacy often break down when the adversary has background knowledgeMassachusetts Group Insurance Medical Encounter Database [SS98]Background Knowledge: Voter Registration RecordNetflix Prize Dataset[NS08]Background Knowledge: IMDB