/
The Role of Metadata in Machine The Role of Metadata in Machine

The Role of Metadata in Machine - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
393 views
Uploaded On 2015-11-28

The Role of Metadata in Machine - PPT Presentation

Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajonesh5com mbazrafshanh5com fdelgadoh5com tlihatshh5com tschuylerh5com ID: 207772

tar metadata sets data metadata tar data sets text extended values models machine set incorporating learning significant count improvement

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Role of Metadata in Machine" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Role of Metadata in MachineLearning for TAR

Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler

ajones@h5.com mbazrafshan@h5.com fdelgado@h5.com tlihatsh@h5.com tschuyler@h5.comSlide2

Metadata Use in TAR – Lack of Consensus

2

It is generally agreed across the industry that metadata

is a critical component of

ESI for eDiscovery.

Some

view incorporation of metadata into machine learning algorithm

development for TAR

as a matter of

course.

Others view it as atypical, if not incompatible, with machine

learning

approaches to document classification.Slide3

Metadata in TAR – Goals of the Study

3

If

metadata

provides information that is vital for manual document review in eDiscovery, though, why would it be any less valuable for TAR?

Goals of the current study:

Establish

the potential benefit of

incorporating metadata into TAR algorithm

development

processes.

Establish the potential benefit of leveraging widely inclusive sets of metadata, as opposed to limited pre-determined sets.

Establish the potential benefit of integrating

metadata

using techniques that preserve the added layer of information associated with metadata values.Slide4

Metadata in TAR – Data & Methods

4

3 Distinct Data

S

ets:

1 drawn from Topic 301 of the TREC 2010 Interactive Task

2 proprietary business data sets

Random sample

of 4500

individually labeled documents for each

Split into

a

3000 document

Control Set and a

1500-document

Training

Set

All machine

learning

models developed using an open source Support

Vector

Machine (SVM) implementation

Performance Metric: Area

Under the Receiver Operating Characteristic Curve (AUROC

)Slide5

Metadata in TAR – Metadata Choices

5

Metadata availability varied across data sets

Fields were chosen opportunistically based on

availability and

amenability to feature transformation

Fields that were populated for fewer than

5% of

the documents were omitted

Continuous metadata

values were transformed into categorical

values.

For

example, date values were collapsed into simple Month-Year

values ; file

size values were assigned to

categories

ranging from very small to very

large.Slide6

Metadata in TAR – Metadata Choices

6

Standard Metadata:

Author

Sender, Recipient, Cc

Subject

, Title, File

Name

Document

Type, File

Extension

Sent

Date, Created

Date

Sender

Domain, Recipient

Domain

Extended Metadata (all

of the above, plus

):

All Custodians, Primary

Custodian

Record Type

Attachment Name

Bates Prefix

Drop Id

Company/Organization

Native

File Size, Text

Size

Normalized

Date, Parent

Date

Family

Count, Attachment

Count

Recipient

Count, Cc Count, Combined Recipient

Count

Page CountSlide7

Metadata in TAR – Incremental Testing

7

Hypothesis 1: Incorporating

metadata into the machine learning process

will

lead to improved

model performance.

Text

from Standard Metadata

added

to

the body

text of

documents

There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Set 3.

Hypothesis 2: Incorporating the text from Extended Metadata will lead to superior results as compared to incorporating Standard Metadata alone

.

Text from

Extended Metadata

added to the body text of

documents – compared to models based on addition of Standard Metadata

There was a general trend of improvement across the three data sets. The improvement was highly significant for Data

Sets 1 and

3

.Slide8

Metadata in TAR – Incremental Testing

8

Hypothesis 3

: Using metadata values

in ways that preserve both attribute and value information will result in superior performance.

Extended Metadata

values

prefixed

to indicate their field origins added to

body

text

– compared

to models

with Extended Metadata added as plain text

Improvements varied across the three data sets, but significant for Data Set 2.

Dual modeling – prefixed metadata values and simple body text modeled independently, scores from two models multiplied to arrive at a final score – dual models compared to single models with prefixed Extended Metadata

There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Sets

2

and 3.Slide9

9

Stepping back from incremental pairwise comparisons - clearer answers and more striking differences emerge

Models incorporating Extended Metadata

significantly

outperformed

models

based on body text alone in each condition for every data set.

Overall Findings – MD Can Improve TARSlide10

10

Similarly strong trends can be observed when each model created using Standard Metadata is compared to its Extended Metadata counterpart.

Extended Metadata improvements were

highly significant

in all cases for Data Sets 1 and 3 and significant for the dual model in Data Set 2.

Overall Findings –

More MD Is BetterSlide11

11

Incorporating metadata as

an integral component of

machine learning processes

for

TAR in eDiscovery will benefit the community of practice.

Neglecting

this

resource is – at best – a

missed

opportunity. In an information retrieval effort, why leave information on the table?

To realize the full potential of using metadata in machine learning for TAR, practitioners should not rely

solely on

a

limited

intuitive set of metadata.

Examining

the contributions of specific metadata fields at a more granular

level could be very worthwhile.

Is “all available” always the best choice

?

There are still more questions than answers when it comes to the use of metadata in modeling for TAR

.

More

effective

algorithms?

Better techniques for capturing the full metadata contribution?

Metadata in TAR – ConclusionsSlide12

Questions?