with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation NSF grants CCF1349153 CCF1320578 TWC1223828 CCF1117937 CCF1017334 and CCF1018600 ID: 569755
Download Presentation The PPT/PDF document "Efficiently Mining Source Code" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Efficiently Mining Source Codewith Boa
Robert Dyer
The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.
Tien N. Nguyen
Hridesh Rajan
Hoan Anh NguyenSlide2
2
What do I mean bysoftware repository
?Slide3
3Slide4
4
What features do they have?Slide5
5
What do I mean bymining software repositories (MSR)?Slide6
6Slide7
7
What are some examples ofsoftware repository mining?Slide8
8
What is the most used programming language?Slide9
9
How many words are in commit messages?
Words[] = update, 30715
Words[] = cleanup, 19073
Words[] = updated, 18737
Words[] = refactoring, 11981
Words[] = fix, 11705
Words[] = test, 9428
Words[] = typo, 9288Words[] = updates, 7746Words[] = javadoc, 6893Words[] = bugfix, 6295Slide10
10
How has unit testing evolved over time?
JUnit 4 releaseSlide11
11
What makes thisultra-large-scale mining?Slide12
12
Previous examples queried...
Projects
699,331
Code Repositories
494,158
Revisions
15,063,073
Unique Files
69,863,970
File Snapshots
147,074,540
AST Nodes
18,651,043,23
Over
250GB
of
pre-processed
dataSlide13
13
What doesbringing BIGDATA to the massesmean?Slide14
14
How has unit testing evolved over time?
How can we solve this task?Slide15
15
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sourcesSlide16
16
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: VolumeSlide17
17
Challenge: Volume
Projects
699,331
Code Repositories
494,158
Revisions
15,063,073
Unique Files
69,863,970
File Snapshots
147,074,540
AST Nodes
18,651,043,23
How do you:
Find such a large dataset? Transform the data for analysis?
Access this data? Efficiently analyze the data?
Store the data?Slide18
18
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: VelocitySlide19
19
Challenge: VelocitySlide20
20
Challenge: VelocitySlide21
21
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: VarietySlide22
22
Challenge: VarietySlide23
Ultra-large-scale Software Repository MiningThe Boa Experience
[ICSE'14]
[ICSE'13][GPCE'13][SPLASH'13 SRC][TOSEM] (under review)Slide24
24
Boa's Architecture
Replicate
Stored on
cluster
User submits
query
Deployed and
executed on cluster
Query result
returned
via web
cache
Boa's Data Infrastructure
and Transform
Compiled into
Hadoop program
Boa's Computing InfrastructureSlide25
25
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: Volume
Challenge: Velocity
Challenge: VarietySlide26
26
Tests:
output sum[timestamp]
of int;
cur_time:
timestamp;visit(input
, visitor { before
n:
Revision -> cur_time = n.commit_date; before n: Modifier
-> if (n.kind == ModifierKind.ANNOTATION
&&
match(`^(org\.junit\.)?Test$`, n.annotation_name))
Tests[cur_time] << 1;});
Automatically parallelized
Analyzes 18 billion AST nodes in minutes
Only
10 lines of code
No external libraries
A better solution...Slide27
27
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;Slide28
28
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;
visit
(input,
visitor {
});Slide29
29
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;
visit
(input,
visitor { before
n:
Modifier ->
});Slide30
30
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;
visit
(input,
visitor { before
n:
Modifier -> if (n.kind == ModifierKind.ANNOTATION
&&});Slide31
31
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;
visit
(input,
visitor { before
n:
Modifier -> if (n.kind == ModifierKind.ANNOTATION
&& match(`^(org\.junit\.)?Test$`
,
n.annotation_name))});Slide32
32
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;
visit
(input,
visitor { before
n:
Modifier -> if (n.kind == ModifierKind.ANNOTATION
&& match(`^(org\.junit\.)?Test$`,
n.annotation_name))
Tests[cur_time] << 1;});Slide33
33
How has unit testing evolved over time?
Tests: output sum[
timestamp] of int;
cur_time:
timestamp;visit(
input, visitor {
before
n: Revision -> cur_time = n.commit_date; before n:
Modifier -> if (n.kind ==
ModifierKind.ANNOTATION &&
match(`^(org\.junit\.)?Test$`,
n.annotation_name)) Tests[cur_time] << 1;});Slide34
34
Tests:
output
sum
[
timestamp
] of int;
cur_time:
timestamp;visit
(input, visitor {
before
n: Revision -> cur_time = n.commit_date; before
n: Modifier -> if
(n.kind ==
ModifierKind.ANNOTATION
&&
match
(
`^(org\.junit\.)?Test$`
,
n.annotation_name))
Tests[cur_time] << 1;
});
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(
input
, visitor {
before n: Revision -> cur_time = n.commit_date;
before n: Modifier ->
if (n.kind == ModifierKind.ANNOTATION &&
match(`^(org\.junit\.)?Test$`,
n.annotation_name))
Tests[cur_time] << 1;
});
input
= project
1
input
= project
2
input
= project
3
input
= project
n
.
.
.
Dataset
Tests: output sum[timestamp] of int;
cur_time:
timestamp
;
visit
(
input
,
visitor
{
before
n:
Revision
-> cur_time = n.commit_date;
before
n:
Modifier
->
if
(n.kind ==
ModifierKind.ANNOTATION
&&
match
(
`^(org\.junit\.)?Test$`
, n.annotation_name)) Tests[cur_time] << 1;});Boa Program
Boa Program
Boa Program
Boa Program
.
.
.
Tests
Tests[631152000] = 5
Tests[631154020] = 12
Tests[631161103] = 14
Tests[631172392] = 18
.
.
.
Output
Tests:
output
sum
[
timestamp
]
of
int
;
cur_time:timestamp;
visit(input, visitor {
before n: Revision -> cur_time = n.commit_date;
before n: Modifier ->
if (n.kind == ModifierKind.ANNOTATION &&
match(`^(org\.junit\.)?Test$`,
n.annotation_name))
Tests[cur_time] << 1;
});
Tests:
output
sum
[
timestamp
]
of
int
;
cur_time:timestamp;
visit(input, visitor {
before n: Revision -> cur_time = n.commit_date;
before n: Modifier ->
if (n.kind == ModifierKind.ANNOTATION &&
match(`^(org\.junit\.)?Test$`,
n.annotation_name))
Tests[cur_time] << 1;
});
Tests[631152000] << 1;
631152000, 1
Tests[631154020] << 1;
631152000, 1
631154020, 1
631152000, 1
631154020, 1
631154020, 1
631161103, 1Slide35
35
Automatic Parallelization
Tests:
output
sum
[
timestamp] of int
;
cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date;
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`,
n.annotation_name))
Tests[cur_time] << 1;});
Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc
Compiler generates
Hadoop MapReduce
codeSlide36
36
Abstracting MSR with Types
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor {
before n:
Revision
-> cur_time =
n.commit_date; before n: Modifier ->
if (n.kind == ModifierKind.ANNOTATION &&
match(`^(org\.junit\.)?Test$`,
n.annotation_name)) Tests[cur_time] << 1;});
Custom domain-specific types
for mining software repositories
5 base types
and
9 types for source code
No need to understand multiple data formats or APIsSlide37
37
Abstracting MSR with Types
Project
CodeRepository
Revision
ChangedFile
ASTRoot
1
1..*
1
*
1
*
1
0..1Slide38
38
Abstracting MSR with Types
ASTRoot
Namespace
Declaration
1
*
1
1..*
Method
Variable
Type
1
*
1
*
1
*
Statement
Expression
*
*
1
1Slide39
39
Challenge: How can we make mining source code easier?
Answer:
Declarative VisitorsSlide40
40
Background: Visitor Pattern
Rectangle
Triangle
draw(Graphics g)
scale(int x, int y)Circle
draw(Graphics g)scale(int x, int y)
draw(Graphics g)
scale(int x, int y)
Rectangle
Triangle
accept(Visitor v)
Circle
accept(Visitor v)
accept(Visitor v)
DrawVisitor
visit(Rectangle r)
visit(Circle c)
visit(Triangle t)
ScaleVisitor
visit(Rectangle r)
visit(Circle c)
visit(Triangle t)Slide41
41
Easing Source Code Mining with Visitors
id := visitor {
before T -> statement; after
T -> statement;
};visit
(node, id);Slide42
42
Easing Source Code Mining with Visitors
id := visitor {
before id : T1 -> statement;
before
T2, T3 -> statement; before _ -> statement;};Slide43
43
Easing Source Code Mining with Visitors
ASTRoot
Namespace
Declaration
Method
Variable
Type
Statement
Expression
ASTRoot
Namespace
Declaration
Method
Variable
Type
Statement
ExpressionSlide44
44
before
n: Declaration -> {
}
Easing Source Code Mining with Visitors
Method
Type
Statement
Expression
ASTRoot
Namespace
Declaration
Variable
before
n:
Declaration
-> {
foreach
(i:
int
; n.fields[i])
visit
(n.fields[i]);
}
before
n:
Declaration
-> {
foreach
(i:
int
; n.fields[i])
visit
(n.fields[i]);
stop
;
}Slide45
45
Let's see it in action!
http://boa.cs.iastate.edu/boa/Slide46
46
Summary
Ultra-large-scale software repository miningposes several challenges
Automatically parallelizes queries
Domain-specific language, types, and functions
to make mining software repositories easier
Boa provides abstractions to address
these challenges
Ultra-large-scale dataset with almost 700k projectsSlide47
47
Boa's Global Impact
90+ users from over 20 countries!Slide48
48
Thank you!
http://boa.cs.iastate.edu/