with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation NSF grants CCF1349153 CCF1320578 TWC1223828 CCF1117937 CCF1017334 and CCF1018600 ID: 759758
Download Presentation The PPT/PDF document "Efficiently Mining Source Code" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Efficiently Mining Source Codewith Boa
Robert Dyer
The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.
Tien N. Nguyen
Hridesh Rajan
Hoan Anh Nguyen
Slide22
What do I mean by
software repository
?
Slide33
Slide44
What features do they have?
Slide55
What do I mean by
mining software repositories (MSR)
?
Slide66
Slide77
What are some examples of
software repository mining?
Slide88
What is the most used programming language?
Slide99
How many words are in commit messages?
Words[] = update, 30715
Words[] = cleanup, 19073
Words[] = updated, 18737
Words[] = refactoring, 11981
Words[] = fix, 11705
Words[] = test, 9428
Words[] = typo, 9288
Words[] = updates, 7746
Words[] = javadoc, 6893
Words[] = bugfix, 6295
Slide1010
How has unit testing evolved over time?
JUnit 4 release
Slide1111
What makes this
ultra-large-scale
mining?
Slide1212
Previous examples queried...
Projects699,331Code Repositories494,158Revisions15,063,073Unique Files69,863,970File Snapshots147,074,540AST Nodes18,651,043,23
Over
250GB
of
pre-processed
data
Slide1313
What does
bringing BIGDATA to the masses
mean?
Slide1414
How has unit testing evolved over time?
How can we solve this task?
Slide1515
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Slide1616
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: Volume
Slide1717
Challenge: Volume
Projects699,331Code Repositories494,158Revisions15,063,073Unique Files69,863,970File Snapshots147,074,540AST Nodes18,651,043,23
How do you:
Find such a large dataset? Transform the data for analysis?
Access this data? Efficiently analyze the data?
Store the data?
Slide1818
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: Velocity
Slide1919
Challenge: Velocity
Slide2020
Challenge: Velocity
Slide2121
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: Variety
Slide2222
Challenge: Variety
Slide23Ultra-large-scale Software Repository Mining
The Boa Experience
[ICSE'14]
[ICSE'13]
[GPCE'13]
[SPLASH'13 SRC]
[TOSEM]
(under review)
Slide2424
Boa's Architecture
Replicate
Stored on
cluster
User submits
query
Deployed and
executed on cluster
Query result
returned
via web
cache
Boa's Data Infrastructure
and Transform
Compiled into
Hadoop program
Boa's Computing Infrastructure
Slide2525
Results
foreach
mine project
metadata
Has
repository?
Method has
@Test?
yes
yes
Access
repository
Find all
methods
Find all
source files
mine
revisions
mine
sources
Challenge: Volume
Challenge: Velocity
Challenge: Variety
Slide2626
Tests: output sum[timestamp] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Automatically parallelized
Analyzes 18 billion AST nodes in minutes
Only 10 lines of code
No external libraries
A better solution...
Slide2727
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
Slide2828
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
visit
(
input
,
visitor
{
});
Slide2929
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
visit
(
input
,
visitor
{
before
n:
Modifier
->
});
Slide3030
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
visit
(
input
,
visitor
{
before
n:
Modifier
->
if
(n.kind ==
ModifierKind.ANNOTATION
&&
});
Slide3131
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
visit
(
input
,
visitor
{
before
n:
Modifier
->
if
(n.kind ==
ModifierKind.ANNOTATION
&&
match
(
`^(org\.junit\.)?Test$`
,
n.annotation_name))
});
Slide3232
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
visit
(
input
,
visitor
{
before
n:
Modifier
->
if
(n.kind ==
ModifierKind.ANNOTATION
&&
match
(
`^(org\.junit\.)?Test$`
,
n.annotation_name))
Tests[cur_time] << 1;
});
Slide3333
How has unit testing evolved over time?
Tests:
output
sum
[
timestamp
]
of
int
;
cur_time:
timestamp
;
visit
(
input
,
visitor
{
before
n:
Revision
-> cur_time = n.commit_date;
before
n:
Modifier
->
if
(n.kind ==
ModifierKind.ANNOTATION
&&
match
(
`^(org\.junit\.)?Test$`
,
n.annotation_name))
Tests[cur_time] << 1;
});
Slide3434
Tests: output sum[timestamp] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests: output sum[timestamp] of int;
cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
input = project1
input = project2
input = project3
input = projectn
...
Dataset
Tests: output sum[timestamp] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Boa Program
Boa Program
Boa Program
Boa Program
...
Tests
Tests[631152000] = 5
Tests[631154020] = 12
Tests[631161103] = 14
Tests[631172392] = 18
. . .
Output
Tests:
output
sum
[
timestamp
] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests: output sum[timestamp] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests[631152000] << 1;
631152000, 1
Tests[631154020] << 1;
631152000, 1631154020, 1
631152000, 1
631154020, 1
631154020, 1
631161103, 1
Slide3535
Automatic Parallelization
Tests: output sum[timestamp] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc
Compiler generates
Hadoop MapReduce
code
Slide3636
Abstracting MSR with Types
Tests: output sum[timestamp] of int;cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Custom domain-specific types for mining software repositories 5 base types and 9 types for source code
No need to understand multiple data formats or APIs
Slide3737
Abstracting MSR with Types
Project
CodeRepository
Revision
ChangedFile
ASTRoot
1
1..*
1
*
1
*
1
0..1
Slide3838
Abstracting MSR with Types
ASTRoot
Namespace
Declaration
1
*
1
1..*
Method
Variable
Type
1
*
1
*
1
*
Statement
Expression
*
*
1
1
Slide3939
Challenge: How can we make mining source code easier?
Answer:
Declarative Visitors
Slide4040
Background: Visitor Pattern
Rectangle
Triangle
draw(Graphics g)scale(int x, int y)
Circle
draw(Graphics g)scale(int x, int y)
draw(Graphics g)scale(int x, int y)
Rectangle
Triangle
accept(Visitor v)
Circle
accept(Visitor v)
accept(Visitor v)
DrawVisitor
visit(Rectangle r)visit(Circle c)visit(Triangle t)
ScaleVisitor
visit(Rectangle r)visit(Circle c)visit(Triangle t)
Slide4141
Easing Source Code Mining with Visitors
id := visitor { before T -> statement; after T -> statement;};
visit
(node, id);
Slide4242
Easing Source Code Mining with Visitors
id :=
visitor
{
before
id : T1 -> statement;
before
T2, T3 -> statement;
before
_ -> statement;
};
Slide4343
Easing Source Code Mining with Visitors
ASTRoot
Namespace
Declaration
Method
Variable
Type
Statement
Expression
ASTRoot
Namespace
Declaration
Method
Variable
Type
Statement
Expression
Slide4444
before n: Declaration -> {}
Easing Source Code Mining with Visitors
Method
Type
Statement
Expression
ASTRoot
Namespace
Declaration
Variable
before
n:
Declaration
-> {
foreach
(i:
int; n.fields[i]) visit(n.fields[i]);}
before
n:
Declaration
-> {
foreach
(i:
int
; n.fields[i])
visit
(n.fields[i]);
stop
;
}
Slide4545
Let's see it in action!
http://boa.cs.iastate.edu/boa/
Slide4646
Summary
Ultra-large-scale software repository miningposes several challenges
Automatically parallelizes queries
Domain-specific language, types, and functionsto make mining software repositories easier
Boa provides abstractions to addressthese challenges
Ultra-large-scale dataset with almost 700k projects
Slide4747
Boa's Global Impact
90+ users from over 20 countries!
Slide4848
Thank you!
http://boa.cs.iastate.edu/