/
Nectar: Efficient Management of Computation and Data in Dat Nectar: Efficient Management of Computation and Data in Dat

Nectar: Efficient Management of Computation and Data in Dat - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
402 views
Uploaded On 2016-11-23

Nectar: Efficient Management of Computation and Data in Dat - PPT Presentation

Lenin Ravindranath Pradeep Kumar G unda Chandu Thekkath Yuan Yu Li Zhuang Motivation Resources are poorly managed in a data center Computation Storage Redundant computations Wasting resources ID: 492572

nectar cache query program cache nectar program query data server computation storage store select cluster rewriter dryadlinq client manage

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Nectar: Efficient Management of Computat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Nectar: Efficient Management of Computation and Data in Data Centers

Lenin Ravindranath

Pradeep Kumar

G

unda, Chandu Thekkath, Yuan Yu, Li ZhuangSlide2

Motivation

Resources are poorly managed in a data center

Computation

Storage

Redundant

computations

Wasting resources

Manually managed

Unused files occupying space

Redundant

output filesSlide3

Goal

Efficiently manage resources in a cluster

Computation

Storage

NectarSlide4

Key Insight

Data Center

Computation

Storage

Single query interface for computation and data access

DryadLINQ

Query Interface

UserSlide5

Goal

Efficiently manage resources in a cluster

Computation

Storage

NectarSlide6

Computation

PROBLEM: Redundant Computation

Programs share sub queries

Programs share partial data sets

SOLUTION: CachingCache results of popular sub queries Automatically rewrite user query to use cache

X.Select(…)X.Select(…).Where(…)

X.Select(…)

(X+X’).Select(…)

1

2

3

4

5

6

7

2

3

4

5

6

7

8Slide7

Does caching help?

Analyzed logs from production clusters

Logs of 3 months (Oct – Dec 2008)

33 virtual clusters, 36000 jobs

Parsed SCOPE programs, extracted sub queriesSimulated cachingSlide8

Caching helps

About 50% cache hit on 10 clusters

More than 30% cache hit on 20 clusters

35% on averageSlide9

Goal

Efficiently manage resources in a cluster

Computation

Storage

NectarSlide10

Storage

PROBLEM: Manually managed

Unused files occupying space

50% data was never accessed

in the last 275 daysSlide11

Storage

SOLUTION: Automatically manage data

Track usage and delete infrequently used files

Store programs which re-computes the dataSlide12

Query Interface

Data Center

Computation

Storage

DryadLINQ

Query Interface

UserSlide13

Goal

Efficiently manage resources in a cluster

Computation

Storage

NectarSlide14

Nectar

Data Center

Computation

Storage

DryadLINQ

Query Interface

Nectar

UserSlide15

Nectar Architecture

Query Rewriter

DryadLINQ

Dryad

DryadLINQ program

Query

Cache entries

Nectar Client

Cache Server

Add T to cache

P

P’

Add R to cache

R

T

ClusterSlide16

Nectar Architecture

Query Rewriter

Nectar Client

Cache ServerSlide17

Query Rewriter

Select

X

R

X

X’

Select

X’

Select

R

Concat

(R+R’)

CacheSlide18

Query Rewriter

Select

X

R

X

X’

Select

X’

Select

R

Merge Sort

(R+R’)

Cache

Order by

Order by

Order bySlide19

Query Rewriter

Generates multiple plans

Using multiple cache entries

Selects the best plan

Based on benefitExecution timeOutput SizeWhether pipeline is brokenOperators supported

Select, Where, Order by, Group by, JoinX.Select(…)

X.Select(…).Where(…)Slide20

Nectar Architecture

Query Rewriter

Nectar Client

Cache ServerSlide21

Cache Server

SQL Server

Garbage Collector

Cache Policy

Cache Server

URI

Query Fingerprint

Query + Data

Fingerprint

Execution Time

Output Size

Inquire Stats

Usage Stats

FingerprintsSlide22

Cache policy

Insertion Policy

Always add program output to cache

Sub query outputs are added to cache

Popularity exceeds a thresholdSavings exceeds a thresholdSlide23

Garbage Collector

Storage pressure

Delete infrequently used files

Deletion policy

Based on savings Cache typeMark and sweep algorithmDelete cache entryReachability analysis

Delete files

Cache Server

1

2

3

Distributed FS

1

2Slide24

What if I try to access a garbage collected file?Slide25

Nectar Architecture

Query Rewriter

Nectar Client

Cache Server

Program storeSlide26

Program Store

Store executed programs in the cluster

Output file is tied to its corresponding program that generates the output

If a file is deleted, the program is executed to regenerate the outputSlide27

Managing Data

Nectar Client

Program Store

Distributed FS

foo.pt

Cache Server

FP

FP

Program

FP

A31E4.pt

ToPartitionedTable

(

lenin

\foo.pt)

DryadLINQ

Dryad

usr

Nectar

P’

Program

P

ProgramSlide28

Managing Data

Nectar Client

Program Store

Distributed FS

foo.pt

Cache Server

FP

FP

Program

FP

FromPartitionedTable

(

lenin

\foo.pt)

DryadLINQ

Dryad

usr

Nectar

P

A31E4.ptSlide29

Managing Data

Nectar Client

Program Store

Distributed FS

foo.pt

Cache Server

FP

FP

Program

FP

FromPartitionedTable

(

lenin

\foo.pt)

DryadLINQ

Dryad

usr

Nectar

P

A31E4.pt

Program

KJ1LM.ptSlide30

Goal

Efficiently manage resources in a cluster

Computation

Storage

Nectar

Computation

Storage

Unified

computation

and dataSlide31

Distributed cache servers

Cache Server

SQL Server

Partitioned by

query fingerprint

Nectar Client

Centralized

Garbage collector

Hash based on

query fingerprint

Program store

Program store

Cache Server

SQL ServerSlide32

Summary

We built Nectar

Automatically manage data

Efficiently manage computation

ComponentsQuery RewriterAutomatically rewrite queries to use cache

Cache serverPopular sub queries are cachedGarbage collected based on usageProgram storeStore programs which regenerates the outputSlide33

Status

Almost done with development

Query Rewriter

Including other operators

FingerprinterProgram static analysisCache ServerProgram StoreIn the process of deployingSlide34

Can we do better?Slide35

Cluster Utilization

Most clusters have more than 40% Idle time

Even the busiest clusters have 10-20% idle timeSlide36

Exploiting idle time

Do speculative caching

Cache popular data before query issued

Run program on new streams when available

No side effectsExecuted only when cluster is idleLow priority jobsOutput garbage collected with high priorityMore electric bill? Not Really!Slide37

QuestionsSlide38

BackupSlide39

Caching Results