/
James Davis            Christy Coghlan Francisco Servant       Dongyoon Lee James Davis            Christy Coghlan Francisco Servant       Dongyoon Lee

James Davis Christy Coghlan Francisco Servant Dongyoon Lee - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
344 views
Uploaded On 2019-10-31

James Davis Christy Coghlan Francisco Servant Dongyoon Lee - PPT Presentation

James Davis Christy Coghlan Francisco Servant Dongyoon Lee The Impact of Regular Expression Denial of Service ReDoS in Practice Distinguished paper ReDoS regexes are prevalent in the wild ID: 761480

redos regexes amp regex regexes redos regex amp fix downloads log height star false strategies input work heuristics poly

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "James Davis Christy Coghlan F..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

James Davis Christy CoghlanFrancisco Servant Dongyoon Lee The Impact of Regular Expression Denial of Service (ReDoS) in Practice Distinguished paper

ReDoS regexes are prevalent in the wildHeuristics to detect them are inaccurate Developers (try to) fix them with 3 techniques Contributions Thousands! Not just one or two

Background

What are they?Describe a “language”Used forInput validationText manipulation(Poorly tested)[W&S’18]: Thursday! Examples /^a+$/  Some ‘a’/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/  IPv4 /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/  UsernameRegexes ReDoS regex

What are they? match(regex, string)  “Does this regex match this string?”How do they work?/^a+$/ Simulate on input “aaa”Regex Engines

Simple ReDoS regex /^(a+)+$/NFAMalicious input “ aaaaaaaaaa…aa!”Recurrence relationT(n) = 2*T(n-1) = 2*(2*T(n-2)) = O(2 n)ReDoS Regexes Exponential paths Mismatch

Regular Expression Denial of Service (ReDoS) Malicious input injected /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ “ aaa … aaaa !” “Susie” [C&W ‘03] [DWL ‘18] [S&P ‘18]

Research

Theme 1ReDoS Regexes in the Wild

Collecting Regexes 565K 125K 375K (66%) 70K (58%) 350K 60K Module regex extraction 45% 35% Module selection “Can clone” filter Giant list of regexes

Analyzing Regexes [R&T ‘14][WMBW ‘16][WOHD ‘17] 3.6K (1%) 704 (1%) 1. ReDoS regexes 705 (1%) 13K (3%) 350K 60K 2. Degree

ReDoS Regexes are Usually Quadratic

ReDoS Regexes occur in Prominent Places 2 regexes 3 regexes 3 regexes 1 regex1 regex

Theme 2Do Developers’ Heuristics Work?

Finding HeuristicsReference booksRegex websitesWhat they saidStar height – safe-regex“Watch out when different parts of the regex can match the same text” Our Heuristics Heuristics Used in Practice Star heightQ.O.D. Q.O.A. Exponential Polynomial ( ) ( a|a ) a a

Developed detectors for these heuristicsImprecise – modeled on safe-regex (star height)Few false negatives Many false positives

I am now the maintainer of safe-regexFixed false negatives in star height heuristicIncorporating (improved) QOD and QOA heuristicsBringing Research to Practice

Theme 3The Repair of ReDoS Regexes

(ReDoS) Regexes Are Hard to Understand /^(\+|-)?(\d+|(\d*\.\d*))? ( E|e)?([-+])?(\d+)?$/ /([^\=\*\s]+)(\*)?\s*\=\s* (?:([^;'"\s]+\'[\w]*\’ [^;\s]+)|(?:\"([^"]*)\")| ([^;\s]*))(?:\s*(?:;\s*)|$)//^\S+@\S+\.\S+$//^(.*?)([.,:;!]+)$/ /<(\/)?([^ ]+?)(?:(\s*\/)| .*?)?>/ /\+OK.*(<[^>]+>)/ /\s*#?\s*$/ /^\s*/\*\s*(.+?)\s*\*/\s*$/ /^([\\s\\S]*?)((?:\\.{1,2}|[^\\\\\\/]+?|)(\\.[^.\\/\\\\]*|))(?:[\\\\\\/]*)$/ /^(\\/?|)([\\s\\S]*?)((?:\\.{1,2}|[^\\/]+?|(\\.[^.\\/]*|))(?:[\\/]*)$/ [CWS ‘17]

HistoricStudy all ReDoS reports in CVE and Snyk.io DBsDisclosures & Fixes “What do developers prefer when they know all the fix strategies?Email 284 module maintainersVulnerability disclosureFix strategies Methodology

Original /^\S+@\S+\.\S+$/ Fix strategiesTrim TRUNCATE(input, 1000)Revise /^[^@]+@([^\.@]+\.)+$/ Replace* (Custom parser) Exactly one @, somewhere in the middle of the stringA ‘.’ to the right of the @But not immediately to the right Fix Strategies For RedoS Regexes

Fix strategies and correctness 2 incorrect 1 incorrect “All correct”

Closing Remarks

Reachability Module vs. application? Why do developers. use regexes? [C&S ‘16] ReDoS regex == ReDoS? Scaling up [S&P ‘18]? Study how modules are used? (Some) Limitations and Future Work

ReDoS regexes are areal problem in practiceRegexes are widely used in JavaScript and Python modules1% of unique regexes are ReDoS regexes ReDoS regexes occur in 1-3% of modulesHeuristics are inaccurateReDoS regexes are hard to fix Thank you for your attention

Bonus material

We study the sub-class of regexes with super-linear structureStructure permits redundant state exploration via backtrackingGraph reachability problemWe ignore super-linear regex featuresBackreferences, lookaround assertionsSubset of all possible SL regexes

The SL regex detectors are using fullMatch semanticsRegex engines sometimes implement partialMatch…unwiselyThus, and somewhat horrifyingly:/(a+)b/ may be super-linearEquivalent to the fullMatch /^.*?(a+)b/  QOAFalse negatives in the SL regex detectors

Regex usage in practice

Heuristics HeuristicExampleComplexityWhy? Star height > 1(a+)+Exponential 2 choices;each pathexplores thesame statesQ.O. Disjunction*(a|a)+ Exponential''Q.O. Adjacency*.+.+Polynomial (earlier example)2 choices;1 pathexplores …

Domains HTML URL Numeric Emails User-agent strings

Regex Domains Semantic meaning# in npm# in pypiError message 22 K881File name10 K 497HTML8 K2.5 K URL7 K2 KCamelCase etc.4 K 1 KSource code4 K105 User-agent strings3 K124Whitespace2 K441 Number 762 238 Email 444 97Label rate18%13%

ReDoS Regexes occur in Different Domains%

Fix strategies Total Trim ReviseReplace HistoricFix approach37 818 11# Unsafe 3 1 2 0 New Fix approach 4833515 # Unsafe0000 Fixing ReDoS Regexes is hard Revising is popular

We disclosed ~50 ReDoS regexes to MicrosoftAfter several months…Listed me in their July "Security Researcher Acknowledgments"Would not tell me what changes resulted from my reportAbout That Microsoft Regex…

ReDoS regex dects.  More feature support [SJXYML ‘18]Reachability  Why do devs. use regexes? [C&S ‘16]ReDoS regex == ReDoS?  Scaling up [S&P ‘18]? Generalizability  Trends across langs. / apps.? More Limitations and Future Work

Open-source applications may not be representative of code in industryModule ecosystems are shared by open-source and industryModules are sometimes authored by industry as a way to give back to the open-source communityWhy not applications?

Bundling / Obfuscation complicates analysis on registry artifactsWhy git projects instead of artifacts?

ReDoS Regexes by Size and Popularity Lines of code (log) Downloads (log) >1000 downloads/month – Especially concerning

npm pypi ReDoS Regexes by Size and Popularity “Trivial packages” – Abdalkareem et al., FSE’17 Huge but few downloads – e.g. open-sourced company frameworks Lines of code (log scale) Downloads (log scale) Lines of code (log scale) Downloads (log scale) >1000 downloads/month – ReDoS threat

Developed anti-pattern detectorsImprecise – modeled on safe-regex (star height > 1)Results (npm) Anti-pattern # SL regexesFalse positive rateStar height > 1 450 (12%)94%Q.O. Disjunction40 (1%)97% Q.O. Adjacency2.5K (71%)91% Few false negatives Good recall Many false positives Poor precision

How super-linear are the SL regexes?Degree of vuln .npmpypiExponential 245 (7%)41 (6%)Poly: n^2 2638 (74%)534 (76%)Poly: n^3535 (15%)107 (15%) Poly: n^444 (1%)5 (1%)Poly: beyond (exp ?)100 (3%)15 (2%) Evaluated SL regexes on a range of input lengthsFit curves for different polynomials and exponentials Chose the curve of best fit

ReDoS Regexes in npm and pypi Registry TotalmodulesScannedmodulesUniqueregexes SLregexesAffectedmodulesnpm 565K375K(66%)350K3.6K(1%)13K(3%) pypi125K70K(58%)60K 700(1%)700(1%) More copy/paste in JS? e.g. 2K are regexes taken from Node.js core