/
Efficiently Mining Source Code Efficiently Mining Source Code

Efficiently Mining Source Code - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
381 views
Uploaded On 2017-07-13

Efficiently Mining Source Code - PPT Presentation

with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation NSF grants CCF1349153 CCF1320578 TWC1223828 CCF1117937 CCF1017334 and CCF1018600 ID: 569755

tests time amp cur time tests cur amp timestamp int visit annotation visitor input test output sum repository boa

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Efficiently Mining Source Code" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Efficiently Mining Source Codewith Boa

Robert Dyer

The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.

Tien N. Nguyen

Hridesh Rajan

Hoan Anh NguyenSlide2

2

What do I mean bysoftware repository

?Slide3

3Slide4

4

What features do they have?Slide5

5

What do I mean bymining software repositories (MSR)?Slide6

6Slide7

7

What are some examples ofsoftware repository mining?Slide8

8

What is the most used programming language?Slide9

9

How many words are in commit messages?

Words[] = update, 30715

Words[] = cleanup, 19073

Words[] = updated, 18737

Words[] = refactoring, 11981

Words[] = fix, 11705

Words[] = test, 9428

Words[] = typo, 9288Words[] = updates, 7746Words[] = javadoc, 6893Words[] = bugfix, 6295Slide10

10

How has unit testing evolved over time?

JUnit 4 releaseSlide11

11

What makes thisultra-large-scale mining?Slide12

12

Previous examples queried...

Projects

699,331

Code Repositories

494,158

Revisions

15,063,073

Unique Files

69,863,970

File Snapshots

147,074,540

AST Nodes

18,651,043,23

Over

250GB

of

pre-processed

dataSlide13

13

What doesbringing BIGDATA to the massesmean?Slide14

14

How has unit testing evolved over time?

How can we solve this task?Slide15

15

Results

foreach

mine project

metadata

Has

repository?

Method has

@Test?

yes

yes

Access

repository

Find all

methods

Find all

source files

mine

revisions

mine

sourcesSlide16

16

Results

foreach

mine project

metadata

Has

repository?

Method has

@Test?

yes

yes

Access

repository

Find all

methods

Find all

source files

mine

revisions

mine

sources

Challenge: VolumeSlide17

17

Challenge: Volume

Projects

699,331

Code Repositories

494,158

Revisions

15,063,073

Unique Files

69,863,970

File Snapshots

147,074,540

AST Nodes

18,651,043,23

How do you:

Find such a large dataset? Transform the data for analysis?

Access this data? Efficiently analyze the data?

Store the data?Slide18

18

Results

foreach

mine project

metadata

Has

repository?

Method has

@Test?

yes

yes

Access

repository

Find all

methods

Find all

source files

mine

revisions

mine

sources

Challenge: VelocitySlide19

19

Challenge: VelocitySlide20

20

Challenge: VelocitySlide21

21

Results

foreach

mine project

metadata

Has

repository?

Method has

@Test?

yes

yes

Access

repository

Find all

methods

Find all

source files

mine

revisions

mine

sources

Challenge: VarietySlide22

22

Challenge: VarietySlide23

Ultra-large-scale Software Repository MiningThe Boa Experience

[ICSE'14]

[ICSE'13][GPCE'13][SPLASH'13 SRC][TOSEM] (under review)Slide24

24

Boa's Architecture

Replicate

Stored on

cluster

User submits

query

Deployed and

executed on cluster

Query result

returned

via web

cache

Boa's Data Infrastructure

and Transform

Compiled into

Hadoop program

Boa's Computing InfrastructureSlide25

25

Results

foreach

mine project

metadata

Has

repository?

Method has

@Test?

yes

yes

Access

repository

Find all

methods

Find all

source files

mine

revisions

mine

sources

Challenge: Volume

Challenge: Velocity

Challenge: VarietySlide26

26

Tests:

output sum[timestamp]

of int;

cur_time:

timestamp;visit(input

, visitor { before

n:

Revision -> cur_time = n.commit_date; before n: Modifier

-> if (n.kind == ModifierKind.ANNOTATION

&&

match(`^(org\.junit\.)?Test$`, n.annotation_name))

Tests[cur_time] << 1;});

Automatically parallelized

Analyzes 18 billion AST nodes in minutes

Only

10 lines of code

No external libraries

A better solution...Slide27

27

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;Slide28

28

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;

visit

(input,

visitor {

});Slide29

29

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;

visit

(input,

visitor { before

n:

Modifier ->

});Slide30

30

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;

visit

(input,

visitor { before

n:

Modifier -> if (n.kind == ModifierKind.ANNOTATION

&&});Slide31

31

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;

visit

(input,

visitor { before

n:

Modifier -> if (n.kind == ModifierKind.ANNOTATION

&& match(`^(org\.junit\.)?Test$`

,

n.annotation_name))});Slide32

32

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;

visit

(input,

visitor { before

n:

Modifier -> if (n.kind == ModifierKind.ANNOTATION

&& match(`^(org\.junit\.)?Test$`,

n.annotation_name))

Tests[cur_time] << 1;});Slide33

33

How has unit testing evolved over time?

Tests: output sum[

timestamp] of int;

cur_time:

timestamp;visit(

input, visitor {

before

n: Revision -> cur_time = n.commit_date; before n:

Modifier -> if (n.kind ==

ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`,

n.annotation_name)) Tests[cur_time] << 1;});Slide34

34

Tests:

output

sum

[

timestamp

] of int;

cur_time:

timestamp;visit

(input, visitor {

before

n: Revision -> cur_time = n.commit_date; before

n: Modifier -> if

(n.kind ==

ModifierKind.ANNOTATION

&&

match

(

`^(org\.junit\.)?Test$`

,

n.annotation_name))

Tests[cur_time] << 1;

});

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(

input

, visitor {

before n: Revision -> cur_time = n.commit_date;

before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`,

n.annotation_name))

Tests[cur_time] << 1;

});

input

= project

1

input

= project

2

input

= project

3

input

= project

n

.

.

.

Dataset

Tests: output sum[timestamp] of int;

cur_time:

timestamp

;

visit

(

input

,

visitor

{

before

n:

Revision

-> cur_time = n.commit_date;

before

n:

Modifier

->

if

(n.kind ==

ModifierKind.ANNOTATION

&&

match

(

`^(org\.junit\.)?Test$`

, n.annotation_name)) Tests[cur_time] << 1;});Boa Program

Boa Program

Boa Program

Boa Program

.

.

.

Tests

Tests[631152000] = 5

Tests[631154020] = 12

Tests[631161103] = 14

Tests[631172392] = 18

.

.

.

Output

Tests:

output

sum

[

timestamp

]

of

int

;

cur_time:timestamp;

visit(input, visitor {

before n: Revision -> cur_time = n.commit_date;

before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`,

n.annotation_name))

Tests[cur_time] << 1;

});

Tests:

output

sum

[

timestamp

]

of

int

;

cur_time:timestamp;

visit(input, visitor {

before n: Revision -> cur_time = n.commit_date;

before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`,

n.annotation_name))

Tests[cur_time] << 1;

});

Tests[631152000] << 1;

631152000, 1

Tests[631154020] << 1;

631152000, 1

631154020, 1

631152000, 1

631154020, 1

631154020, 1

631161103, 1Slide35

35

Automatic Parallelization

Tests:

output

sum

[

timestamp] of int

;

cur_time:timestamp;visit(input, visitor { before n: Revision -> cur_time = n.commit_date;

before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`,

n.annotation_name))

Tests[cur_time] << 1;});

Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc

Compiler generates

Hadoop MapReduce

codeSlide36

36

Abstracting MSR with Types

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor {

before n:

Revision

-> cur_time =

n.commit_date; before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`,

n.annotation_name)) Tests[cur_time] << 1;});

Custom domain-specific types

for mining software repositories

5 base types

and

9 types for source code

No need to understand multiple data formats or APIsSlide37

37

Abstracting MSR with Types

Project

CodeRepository

Revision

ChangedFile

ASTRoot

1

1..*

1

*

1

*

1

0..1Slide38

38

Abstracting MSR with Types

ASTRoot

Namespace

Declaration

1

*

1

1..*

Method

Variable

Type

1

*

1

*

1

*

Statement

Expression

*

*

1

1Slide39

39

Challenge: How can we make mining source code easier?

Answer:

Declarative VisitorsSlide40

40

Background: Visitor Pattern

Rectangle

Triangle

draw(Graphics g)

scale(int x, int y)Circle

draw(Graphics g)scale(int x, int y)

draw(Graphics g)

scale(int x, int y)

Rectangle

Triangle

accept(Visitor v)

Circle

accept(Visitor v)

accept(Visitor v)

DrawVisitor

visit(Rectangle r)

visit(Circle c)

visit(Triangle t)

ScaleVisitor

visit(Rectangle r)

visit(Circle c)

visit(Triangle t)Slide41

41

Easing Source Code Mining with Visitors

id := visitor {

before T -> statement; after

T -> statement;

};visit

(node, id);Slide42

42

Easing Source Code Mining with Visitors

id := visitor {

before id : T1 -> statement;

before

T2, T3 -> statement; before _ -> statement;};Slide43

43

Easing Source Code Mining with Visitors

ASTRoot

Namespace

Declaration

Method

Variable

Type

Statement

Expression

ASTRoot

Namespace

Declaration

Method

Variable

Type

Statement

ExpressionSlide44

44

before

n: Declaration -> {

}

Easing Source Code Mining with Visitors

Method

Type

Statement

Expression

ASTRoot

Namespace

Declaration

Variable

before

n:

Declaration

-> {

foreach

(i:

int

; n.fields[i])

visit

(n.fields[i]);

}

before

n:

Declaration

-> {

foreach

(i:

int

; n.fields[i])

visit

(n.fields[i]);

stop

;

}Slide45

45

Let's see it in action!

http://boa.cs.iastate.edu/boa/Slide46

46

Summary

Ultra-large-scale software repository miningposes several challenges

Automatically parallelizes queries

Domain-specific language, types, and functions

to make mining software repositories easier

Boa provides abstractions to address

these challenges

Ultra-large-scale dataset with almost 700k projectsSlide47

47

Boa's Global Impact

90+ users from over 20 countries!Slide48

48

Thank you!

http://boa.cs.iastate.edu/