James Davis Christy Coghlan Francisco Servant Dongyoon Lee The Impact of Regular Expression Denial of Service ReDoS in Practice Distinguished paper ReDoS regexes are prevalent in the wild ID: 761480
Download Presentation The PPT/PDF document "James Davis Christy Coghlan F..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
James Davis Christy CoghlanFrancisco Servant Dongyoon Lee The Impact of Regular Expression Denial of Service (ReDoS) in Practice Distinguished paper
ReDoS regexes are prevalent in the wildHeuristics to detect them are inaccurate Developers (try to) fix them with 3 techniques Contributions Thousands! Not just one or two
Background
What are they?Describe a “language”Used forInput validationText manipulation(Poorly tested)[W&S’18]: Thursday! Examples /^a+$/ Some ‘a’/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ IPv4 /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ UsernameRegexes ReDoS regex
What are they? match(regex, string) “Does this regex match this string?”How do they work?/^a+$/ Simulate on input “aaa”Regex Engines
Simple ReDoS regex /^(a+)+$/NFAMalicious input “ aaaaaaaaaa…aa!”Recurrence relationT(n) = 2*T(n-1) = 2*(2*T(n-2)) = O(2 n)ReDoS Regexes Exponential paths Mismatch
Regular Expression Denial of Service (ReDoS) Malicious input injected /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ “ aaa … aaaa !” “Susie” [C&W ‘03] [DWL ‘18] [S&P ‘18]
Research
Theme 1ReDoS Regexes in the Wild
Collecting Regexes 565K 125K 375K (66%) 70K (58%) 350K 60K Module regex extraction 45% 35% Module selection “Can clone” filter Giant list of regexes
Analyzing Regexes [R&T ‘14][WMBW ‘16][WOHD ‘17] 3.6K (1%) 704 (1%) 1. ReDoS regexes 705 (1%) 13K (3%) 350K 60K 2. Degree
ReDoS Regexes are Usually Quadratic
ReDoS Regexes occur in Prominent Places 2 regexes 3 regexes 3 regexes 1 regex1 regex
Theme 2Do Developers’ Heuristics Work?
Finding HeuristicsReference booksRegex websitesWhat they saidStar height – safe-regex“Watch out when different parts of the regex can match the same text” Our Heuristics Heuristics Used in Practice Star heightQ.O.D. Q.O.A. Exponential Polynomial ( ) ( a|a ) a a
Developed detectors for these heuristicsImprecise – modeled on safe-regex (star height)Few false negatives Many false positives
I am now the maintainer of safe-regexFixed false negatives in star height heuristicIncorporating (improved) QOD and QOA heuristicsBringing Research to Practice
Theme 3The Repair of ReDoS Regexes
(ReDoS) Regexes Are Hard to Understand /^(\+|-)?(\d+|(\d*\.\d*))? ( E|e)?([-+])?(\d+)?$/ /([^\=\*\s]+)(\*)?\s*\=\s* (?:([^;'"\s]+\'[\w]*\’ [^;\s]+)|(?:\"([^"]*)\")| ([^;\s]*))(?:\s*(?:;\s*)|$)//^\S+@\S+\.\S+$//^(.*?)([.,:;!]+)$/ /<(\/)?([^ ]+?)(?:(\s*\/)| .*?)?>/ /\+OK.*(<[^>]+>)/ /\s*#?\s*$/ /^\s*/\*\s*(.+?)\s*\*/\s*$/ /^([\\s\\S]*?)((?:\\.{1,2}|[^\\\\\\/]+?|)(\\.[^.\\/\\\\]*|))(?:[\\\\\\/]*)$/ /^(\\/?|)([\\s\\S]*?)((?:\\.{1,2}|[^\\/]+?|(\\.[^.\\/]*|))(?:[\\/]*)$/ [CWS ‘17]
HistoricStudy all ReDoS reports in CVE and Snyk.io DBsDisclosures & Fixes “What do developers prefer when they know all the fix strategies?Email 284 module maintainersVulnerability disclosureFix strategies Methodology
Original /^\S+@\S+\.\S+$/ Fix strategiesTrim TRUNCATE(input, 1000)Revise /^[^@]+@([^\.@]+\.)+$/ Replace* (Custom parser) Exactly one @, somewhere in the middle of the stringA ‘.’ to the right of the @But not immediately to the right Fix Strategies For RedoS Regexes
Fix strategies and correctness 2 incorrect 1 incorrect “All correct”
Closing Remarks
Reachability Module vs. application? Why do developers. use regexes? [C&S ‘16] ReDoS regex == ReDoS? Scaling up [S&P ‘18]? Study how modules are used? (Some) Limitations and Future Work
ReDoS regexes are areal problem in practiceRegexes are widely used in JavaScript and Python modules1% of unique regexes are ReDoS regexes ReDoS regexes occur in 1-3% of modulesHeuristics are inaccurateReDoS regexes are hard to fix Thank you for your attention
Bonus material
We study the sub-class of regexes with super-linear structureStructure permits redundant state exploration via backtrackingGraph reachability problemWe ignore super-linear regex featuresBackreferences, lookaround assertionsSubset of all possible SL regexes
The SL regex detectors are using fullMatch semanticsRegex engines sometimes implement partialMatch…unwiselyThus, and somewhat horrifyingly:/(a+)b/ may be super-linearEquivalent to the fullMatch /^.*?(a+)b/ QOAFalse negatives in the SL regex detectors
Regex usage in practice
Heuristics HeuristicExampleComplexityWhy? Star height > 1(a+)+Exponential 2 choices;each pathexplores thesame statesQ.O. Disjunction*(a|a)+ Exponential''Q.O. Adjacency*.+.+Polynomial (earlier example)2 choices;1 pathexplores …
Domains HTML URL Numeric Emails User-agent strings
Regex Domains Semantic meaning# in npm# in pypiError message 22 K881File name10 K 497HTML8 K2.5 K URL7 K2 KCamelCase etc.4 K 1 KSource code4 K105 User-agent strings3 K124Whitespace2 K441 Number 762 238 Email 444 97Label rate18%13%
ReDoS Regexes occur in Different Domains%
Fix strategies Total Trim ReviseReplace HistoricFix approach37 818 11# Unsafe 3 1 2 0 New Fix approach 4833515 # Unsafe0000 Fixing ReDoS Regexes is hard Revising is popular
We disclosed ~50 ReDoS regexes to MicrosoftAfter several months…Listed me in their July "Security Researcher Acknowledgments"Would not tell me what changes resulted from my reportAbout That Microsoft Regex…
ReDoS regex dects. More feature support [SJXYML ‘18]Reachability Why do devs. use regexes? [C&S ‘16]ReDoS regex == ReDoS? Scaling up [S&P ‘18]? Generalizability Trends across langs. / apps.? More Limitations and Future Work
Open-source applications may not be representative of code in industryModule ecosystems are shared by open-source and industryModules are sometimes authored by industry as a way to give back to the open-source communityWhy not applications?
Bundling / Obfuscation complicates analysis on registry artifactsWhy git projects instead of artifacts?
ReDoS Regexes by Size and Popularity Lines of code (log) Downloads (log) >1000 downloads/month – Especially concerning
npm pypi ReDoS Regexes by Size and Popularity “Trivial packages” – Abdalkareem et al., FSE’17 Huge but few downloads – e.g. open-sourced company frameworks Lines of code (log scale) Downloads (log scale) Lines of code (log scale) Downloads (log scale) >1000 downloads/month – ReDoS threat
Developed anti-pattern detectorsImprecise – modeled on safe-regex (star height > 1)Results (npm) Anti-pattern # SL regexesFalse positive rateStar height > 1 450 (12%)94%Q.O. Disjunction40 (1%)97% Q.O. Adjacency2.5K (71%)91% Few false negatives Good recall Many false positives Poor precision
How super-linear are the SL regexes?Degree of vuln .npmpypiExponential 245 (7%)41 (6%)Poly: n^2 2638 (74%)534 (76%)Poly: n^3535 (15%)107 (15%) Poly: n^444 (1%)5 (1%)Poly: beyond (exp ?)100 (3%)15 (2%) Evaluated SL regexes on a range of input lengthsFit curves for different polynomials and exponentials Chose the curve of best fit
ReDoS Regexes in npm and pypi Registry TotalmodulesScannedmodulesUniqueregexes SLregexesAffectedmodulesnpm 565K375K(66%)350K3.6K(1%)13K(3%) pypi125K70K(58%)60K 700(1%)700(1%) More copy/paste in JS? e.g. 2K are regexes taken from Node.js core