Michael D Ernst UW CSE Joint work with Arianna Blasi Juan Caballero Sergio Delgado Castellanos Alberto Goffi Alessandra Gorla Xi Victoria Lin Deric Pang Mauro Pezzè ID: 1001975
Download Presentation The PPT/PDF document "Natural language is a programming langua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. Natural language is aprogramming languageMichael D. ErnstUW CSEJoint work with Arianna Blasi, Juan Caballero, Sergio Delgado Castellanos, Alberto Goffi, Alessandra Gorla, Xi Victoria Lin, Deric Pang, Mauro Pezzè, Irfan Ul Haq, Kevin Vu, Chenglong Wang, Luke Zettlemoyer, and Sai Zhang
2. Questions about softwareHow many of you have used software?How many of you have written software?
3. What is software?
4. What is software?A sequence of instructions that perform some task
5. What is software?An engineered object amenable to formal analysisA sequence of instructions that perform some task
6. What is software?A sequence of instructions that perform some task
7. What is software?A sequence of instructions that perform some task
8. What is software?A sequence of instructions that perform some taskTest casesVersion control historyIssue trackerDocumentation…How should it be analyzed?
9. ProgrammingUser storiesRequirementsSpecificationsTestsVersion controlDiscussionsArchitectureProcessModelsDocumentationProgramsIssue tracker
10. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLOutput stringsIssue tracker
11. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLOutput stringsIssue tracker
12. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLOutput stringsIssue tracker
13. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationOutput stringsVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLIssue tracker
14. Analysis of a natural objectMachine learning over executionsVersion control history analysisBug predictionUpgrade safetyPrioritizing warningsProgram repair
15. Specifications are needed;Tests are available but ignoredSpecs are needed. Many papers start:“Given a program and its specification…”Tests are ignored. Formal verification process:Write the programTest the programVerify the program, ignoring testing artifactsObservation: Programmers embed semantic info in testsGoal: translate tests into specificationsApproach: machine learning over executions
16. Dynamic detection of likely invariantsObserve values that the program computesGeneralize over them via machine learningResult: invariants (as in asserts or specifications)x > abs(y)x = 16*y + 4*z + 3array a contains no duplicatesfor each node n, n = n.child.parentgraph g is acyclicUnsound, incomplete, and usefulhttps://plse.cs.washington.edu/daikon/[ICSE 1999]
17. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLOutput stringsIssue tracker
18. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLOutput stringsIssue tracker
19. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLOutput stringsIssue tracker
20. Applying NLP to software engineeringProblemsinadequate diagnosticsincorrect operations missing testsunimplemented functionalityNL sourceserror messagesvariable namescodecommentsuserquestionsNLP techniques document similarity word semantics parse trees translation Analyzeexisting codeGeneratenew code
21. Applying NLP to software engineeringProblemsinadequate diagnosticsincorrect operations missing testsunimplemented functionalityNL sourceserror messagesvariable namescodecommentsuserquestionsNLP techniques document similarity word semantics parse trees translation [ISSTA 2015]
22. Inadequate diagnostic messagesScenario: user supplies a wrong configuration option--port_num=100.0Problem: software issues an unhelpful error message“unexpected system failure”“unable to establish connection”Hard for end users to diagnoseGoal: detect such problems before shipping the codeBetter message: “--port_num should be an integer”
23. Challenges for proactive detection of inadequate diagnostic messagesHow to trigger a configuration error?How to determine the inadequacy of a diagnostic message?
24. How to trigger a configuration error?How to determine the inadequacy of a diagnostic message?ConfDiagDetector’s solutionsConfiguration mutation + run system testsUse a NLP technique to check its semantic meaningsystem tests configuration+failed tests ≈ triggered errorsDiagnostic messages output by failed testsUser manualSimilar semantic meanings?(Assumption: a manual, webpage, or man page exists.)(We know the root cause.)
25. When is a message adequate?Contains the mutated option name or value [Keller’08, Yin’11]Mutated option: --percentage-splitDiagnostic message: “the value of percentage-split should be > 0”Similar semantic meaning as the manual descriptionMutated option: --fnumDiagnostic message: “Number of folds must be greater than 1”User manual description of --fnum: “Sets number of folds for cross-validation”
26. Classical document similarity:TF-IDF + cosine similarityConvert document into a real-valued vectorDocument similarity = vector cosine similarityVector length = dictionary size, values = term frequency (TF)Example: [2 classical, 8 document, 3 problem, 3 values, …]Problem: frequent words swamp important wordsSolution: values = TF x IDF (inverse document frequency)IDF = log(total documents / documents with the term)Problem: does not work well on very short documents
27. Text similarity technique [Mihalcea’06]Manual descriptionA messageThe documents have similar semantic meaningsif many words in them have similar meaningsThe program goes wrongThe software failsExample:Remove all stop words.For each word in the diagnostic message,try to find similar words in the manual.Two sentences are similar, if “many” wordsare similar between them.
28. ResultsReported 25 missing and 18 inadequate messagesin Weka, JMeter, Jetty, DerbyValidation by 3 programmers:0% false negative rateTool says message is adequate, humans say it is inadequate2% false positive rateTool says message is inadequate, humans say it is adequatePrevious best: 16%
29. Related workConfiguration error diagnosis techniquesDynamic tainting [Attariyan’08], static tainting [Rabkin’11], Chronus [Whitaker’04]Troubleshooting an exhibited error rather than detecting inadequate diagnostic messagesSoftware diagnosability improvement techniquesPeerPressure [Wang’04], RangeFixer [Xiong’12], ConfErr [Keller’08] and Spex-INJ [Yin’11], EnCore [Zhang’14]Requires source code, usage history, or OS-level support
30. Applying NLP to software engineeringProblemsinadequate diagnosticsincorrect operations missing testsunimplemented functionalityNL sourceserror messagesvariable namescodecommentsuserquestionsNLP techniques document similarity word semantics parse trees translation [WODA 2015]
31. Undesired variable interactionsint totalPrice;int itemPrice;int shippingDistance;totalPrice = itemPrice + shippingDistance;The compiler issues no warningA human can tell the abstract types are differentIdea:Cluster variables based on usage in program operationsCluster variables based on words in variable namesDifferences indicate bugs or poor variable names
32. Undesired variable interactionsint totalPrice;int itemPrice;int shippingDistance;totalPrice = itemPrice + shippingDistance;The compiler issues no warningA human can tell the abstract types are differentIdea:Cluster variables based on words in variable namesCluster variables based on usage in program operationsDifferences indicate bugs or poor variable names
33. Undesired interactionsdistanceitemPricetax_ratemilesshippingFeepercent_complete
34. Undesired interactionsdistanceitemPricetax_ratemilesshippingFeepercent_completeitemPrice + distance
35. Undesired interactionsdistanceitemPricetax_ratemilesshippingFeepercent_completeintfloatProgram types don’t help
36. Undesired interactionsdistanceitemPricetax_ratemilesshippingFeepercent_completeLanguage indicates the problem
37. Variables
38. Variable clusteringCluster based on interactions:operations
39. Variable clusteringCluster based on language:variable names
40. Variable clusteringCluster based on language:variable namesCluster based on interactions:operationsProblemActual algorithm:Cluster based on operationsSub-cluster based on namesRank an operation cluster as suspiciousif it contains well-defined name sub-clusters
41. Clustering based on operationsAbstract type inference [ISSTA 2006]int totalCost(int miles, int price, int tax) { int year = 2016; if ((miles > 1000) && (year > 2000)) { int shippingFee = 10; return price + tax + shippingFee; } else { return price + tax; }}
42. Clustering based on operationsAbstract type inference [ISSTA 2006]int totalCost(int miles, int price, int tax) { int year = 2016; if ((miles > 1000) && (year > 2000)) { int shippingFee = 10; return price + tax + shippingFee; } else { return price + tax; }}
43. Clustering based on variable namesCompute variable name similarity for var1 and var2Tokenize each variable into dictionary wordsin_authskey15 ⇒ {“in”, “authentications”, “key”}Expand abbreviations, best-effort tokenizationCompute word similarityFor all w1 ∈ var1 and w2 ∈ var2, use WordNet (or edit distance)Combine word similarity into variable name similaritymaxwordsim(w1, var2) = max wordsim(w1, w2)varsim(var1, var2) = average maxwordsim(w1, var2)w2 ∈ var2w1 ∈ var1
44. ResultsRan on grep and Exim mail serverTop-ranked mismatch indicatesan undesired variable interaction in grepif (depth < delta[tree->label]) delta[tree->label] = depth; Loses top 3 bytes of depthNot exploitable because of guards elsewhere in program, but not obvious here
45. Related workReusing identifier names is error-prone [Lawrie 2007, Deissenboeck 2010, Arnaoudova 2010]Identifier naming conventions [Simonyi]Units of measure [Ada, F#, etc.]Tokenization of variable names [Lawrie 2010, Guerrouj 2012]
46. Applying NLP to software engineeringProblemsinadequate diagnosticsincorrect operations missing testsunimplemented functionalityNL sourceserror messagesvariable namescodecommentsuserquestionsNLP techniques document similarity word semantics parse trees translation [ISSTA 2016]
47. Test oracles (assert statements)A test consists ofan input (for a unit test, a sequence of calls)an oracle (an assert statement)Programmer-written testsoften trivial oracles, or too few testsAutomatic generation of tests:inputs are easy to generateoracles remain an open challengeGoal: create test oraclesfrom what programmers already write
48. Automatic test generationCode under test:public class FilterIterator implements Iterator { public FilterIterator(Iterator i, Predicate p) {…} public Object next() {…} …}Automatically generated test:public void test() { FilterIterator i = new FilterIterator(null, null); i.next();}Throws NullPointerException!Did the tool discover a bug?It could be:1. Expected behavior2. Illegal input3. Implementation bug/** @throws NullPointerException if either * the iterator or predicate are null */
49. Automatically generated testsA test generation tool outputs:Failing tests – indicates a program bugPassing tests – useful for regression testingWithout a specification, the tool guesseswhether a given behavior is correctFalse positives: report a failing testthat was due to illegal inputsFalse negatives: fail to report a failing testbecause it might have been due to illegal inputs
50. Programmers write code commentsJavadoc is standard procedure documentation/** * Checks whether the comparator is now * locked against further changes. * * @throws UnsupportedOperationException * if the comparator is locked */protected void checkLocked() {...}
51. Javadoc comment and assertionclass MyClass { ArrayList allFoundSoFar = …; boolean canConvert(Object arg) { … } /** @throws IllegalArgumentException if the * element is not in the list and is not * convertible. */ void myMethod(Object element) { … }}Condition for exception: myMethod should throw iff … ( !allFoundSoFar.contains(element) && !canConvert(element) )
52. Nouns = objects, verbs = operationsSNPVPVADJPADJPPThe element is greater than the current maximum.NPPXeltcompareTo()>0currentMaxelt.compareTo(currentMax) > 0nounverbnoun
53. Text to code: Toradocu algorithmParse @param, @return, and @throws expressions using the Stanford ParserParse tree, grammatical relations, cross-referencesChallenges:Often not a well-formed sentence; code snippets as nouns/verbsReferents are implicit, assumes coding knowledgeMatch each subject to a Java elementPattern matchingLexical similarity to identifiers, types, documentation Match each predicate to a Java elementCreate assert statement from expressions and methods
54. ResultsAccuracy on 857 Javadoc tags:97% precision72% recallCan tune parameters to favor either metricPre-processing and pattern-matching are importantDiscovered specification errorsImproving test generation tools:Reduced false positive test failures in EvoSuite by ≥ 1/3Also improved Randoop, but by less
55. Related workHeuristicsJCrasher, Crash’n’Check [Csallner’04, Csallner’05]Randoop [Pacheco’07]SpecificationsASTOOT [Doong’94]Models, contracts, …PropertiesCross-checking oracles [Carzaniga’14]Metamorphic testing [Chen’13]Symmetric testing [Gotlieb’03]Natural language documentationiComment, aComment, @tComment [Tan’07, Tan’11, Tan’12]
56. Applying NLP to software engineeringProblemsinadequate diagnosticsincorrect operations missing testsunimplemented functionalityNL sourceserror messagesvariable namescodecommentsuserquestionsNLP techniques document similarity word semantics parse trees translation
57. Machine translationEnglish: “My hovercraft is full of eels.”Spanish: “Mi aerodeslizador está lleno de anguilas.”English: “Don’t worry.”Spanish: “No te preocupes.”
58. Sequence-to-sequence recurrent neural network translatorsMyishover-craftfullofeels.<START>MiMiaerodeslizadoraerodeslizadorinput layeroutput layerhidden layer……attention mechanismInput, hidden, and output functionsare inferred from training datausing probability maximization.
59. Tellina: text to commandsTraining data: ~5000 ⟨text, command⟩ pairsCollected manually from webpages, plus cleaning17 file system utilities, > 200 flags, 9 types of constantsCompound commands: (), &&, ||Nesting: |, $(), <()Strings are opaque; no command interpreters (awk, sed)No bash compound statements (for)
60. ResultsAccuracy for Tellina’s first output:Structure of command (without constants): 69%Full command (with constants): 30%User experiment:Tellina makes users 22% more efficientEven though it rarely gives a perfect commandQualitative feedbackMost participants wanted to continue using Tellina (5.8/7 Likert scale)Partially-correct answers were helpful, not too hard to correctOutput bash commands are sometimes non-syntactic or subtly wrongNeeds explanation of meaning of output bash commands
61. Related workNeural machine translationSequence-to-sequence learning with neural nets [Sutskever 2014]Attention mechanism [Luong 2015]Semantic parsing Translating natural language to a formal representation [Zettlemoyer 2007, Pasupat 2016]Translating natural language to DSLsIf-this-then-that recipes [Quirk 2015]Regular expressions [Locascio 2016]Text editing, flight queries [Desai 2016]
62. Other software engineering projectsAnalyzing programs before they are writtenGamification (crowd-sourcing) of verificationEvaluating and improving fault localizationPluggable type-checking for error prevention… many more: systems, synthesis, verification, etc. UW is hiring! Faculty, postdocs, grad students
63. Applying NLP to software engineeringProblemsinadequate diagnosticsincorrect operations missing testsunimplemented functionalityNL sourceserror messagesvariable namescodecommentsuserquestionsNLP techniques document similarity word semantics parse trees translation
64. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationOutput stringsVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLIssue tracker
65. Analyzing textiComment [Tan 2007]: pattern matching for nullN-gram models: code completion [Hindle 2011], predict variable names, whitespace [Allemanis 2014]Mining variable names [Pollock et al.]Code comments [Sridhara 2010]DARPA Big Mechanism (read cancer papers)JSNice [Raychev 2015]: learn rules for identifiers and types
66. Analyzing other artifacts bymachine learning over the programTests (dynamic invariant detection)Mining software repositoriesDefect predictionCode completionClone detection… many, many more
67. Machine learning + software engineeringSoftware is more than source codeFormal program analysis is useful, but insufficientAnalyze and generate all software artifactsA rich space for further exploration
68. ProgrammingProgramsUser storiesRequirementsSpecificationsTestsVersion controlDocumentationOutput stringsVariable namesDiscussionsArchitectureProcessModelsDocumentationStructurePLIssue tracker