/
Cloud Storage Security Cloud Storage Security

Cloud Storage Security - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
443 views
Uploaded On 2017-04-21

Cloud Storage Security - PPT Presentation

Murat Kantarcioglu Main Cloud Security Problems VM level attacks Exploit potential vulnerabilities in hypervisor Cloud provider vulnerabilities Eg crosssite scripting vulnerability in Salesforcecom ID: 540174

cloud data public text data cloud text public key conf hadoop access security output control cpp mapreduce map input

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Cloud Storage Security" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Cloud Storage Security

Murat KantarciogluSlide2

Main Cloud Security Problems

VM- level attacks

Exploit potential vulnerabilities in hypervisor

Cloud provider vulnerabilities

E.g. cross-site scripting vulnerability in Salesforce.com

Phishing

Integrating cloud authentication with company authentication mechanisms

Availability

Single point of Failure

Assurance of computational integrity by cloud providerSlide3

Issues with moving data to cloud providers

Will cloud provider fight against a subpoena?

Do you trust Azure logs to show gross negligence on Microsoft part?

Contractual obligations?

If you can hack one place for espionage Gmail could be a good starting point?

Data lock-inSlide4

What is new in cloud computing security?

Too big to fail?

What if Amazon hardware is confiscated?

What if Amazon fails?

Hiding activity patterns

Using cloud for crime?

Secure cloud auditing

Mutual auditability

Snowden AffairsSlide5

Cloud Computing

Like Software as a service and DAS model offers many advantages

Better availability

Reduced Costs

Unlimited scalability and elasticity

5

Cloud Computing

Database

App Server

Code

Email

MultimediaSlide6

Hybrid Cloud

Integrates local infrastructure with public cloud resources

Hybrid Cloud

Extra Advantages

The flexibility of shifting workload to

public cloud

when the

private cloud

is overwhelmed (Cloud Bursting)

Utilizing in-house resources along with public resourcesConsSensitive data exposure Public Cloud Resource Allocation Cost (both storage and computing)

6

Public/ External

Private/ InternalSlide7

Constraints

Data & Computation Partitioning Challenge

7

s_id

name

ssn

dept

1

James

1234

CS

2

Charlie

4321

EE

3John5645

CS4Matt

8743ECON

Q1

:

SELECT name, ssn from Student

Q2

: SELECT dept, count(*) FROM Student GROUP_BY dept

Student

Q1 contains sensitive information

Q2 execution is more expensive

Sensitive

How to partition the table ?

How to split computation? Slide8

Our Hybrid Cloud Architecture

8

Relations R

Queries Q

Constraints C

R

,

Q

priv

R

pub,

Q

pub

Results for Q

priv

User Interface Layer

Statistics Gathering

Layer

Hive

Hadoop HDFS

Hive

Hadoop HDFS

Private

Public

Results for Q

pub

Data and Query Management

LayerSlide9

Design Spectrum

Data Model

Relational

, Semi-structured, Key-Value Stores, Text

Sensitivity Model

Attribute Level

, Privacy Associations, View-Based

Partitioning Models

Workload Partitioning

, Intra-query Parallelism, Dynamic WorkloadMinimization Priority

Running Time, Sensitive Data Disclosure, Monetary Cost

9Slide10

Outline of Solution

Notation

Formulate Computation Partition Problem (CPP)

Solution to CPP

Experimental Results

10Slide11

Notation

sens(R’):

The estimated number of sensitive cells in dataset R’

baseTables(q):

The estimated minimum set of data items necessary to answer query q

Є

Q

runT

x

(q): The estimated running time of query q Є Q at site x (either public or private)ORunT(Q’,Q’’): Overall execution time of queries in Q’, given that queries in Q’’ are executed on the public cloud

11Slide12

R

pub,

Q

pub

R

,

Q

priv

Detailed Hybrid Cloud Architecture

12

runT

x

(q), baseTables(q)

Statistics Gathering

Layer

Data And Query Management Layer

Computation Partitioning Module

Monetary Cost Estimator

Disclosure Risk Estimator

SR

Relations R

Queries Q

Constraints C

Hive

Hadoop HDFS

Hive

Hadoop HDFS

Private

PublicSlide13

Computation Partitioning Problem (CPP)

Find a

subset of given query workload

, and

subset of the given dataset

where

,

are user defined constraints

13Slide14

Metrics in CPP

Query Execution Time (

runT

x

(q)

)

Monetary Costs

stor(R

pub) : Storage monetary cost of the public cloud partition

proc(q) : Processing monetary cost of a public side query qSensitive Data Disclosure Risk (sens(Rpub))Estimated number of sensitive cells within Rpub

14Slide15

Solution to CPP

CPP can be simplified to only finding

Q

pub

Dynamic Programming Approach

CPP (Q, MC, DC)

=

Qpub

15

Input Query Set

Monetary Const.

Disclosure Const

.

OutputSlide16

Example

If

MC < 25

or

DC < 20

CPP({

q

1

, q2, q3}, MC, DC) = CPP({q1, q2}, MC , DC)

16

q

3

can only run on private side.

Slide17

Example

If q

3

can run on both sides

Case 1

CPP({

q

1, q2, q3}, MC, DC) = CPP({q1, q2}, MC , DC)

17

What if q

3

runs on private side.

Slide18

Example

Case 2

CPP(Q, MC, DC) = MIN_TIME (CPP( , j, k)+

q

3

)

where

MC-25 ≤ j ≤ MC-15 and DC-20 ≤ j ≤ DC-0 Choose the minimum overall running time between Case 1 and Case 2

18

What if q

3

runs on public side.

Max-Min possible monetary cost by

q3

Max-Min possible disclosure risk by

q3

Slide19

Experimental Setting

Experimental Setting

Private Cloud:

14

Nodes, located at UTD, Pentium IV,

4

GB Ram,

290-320

GB disk space

Public Cloud: 38 Nodes, located at UCI, AMD Dual Core, 8GB Ram, 631GB disk spaceHadoop

0.20.2 and Hive 0.7.1Dataset and Statistic Collection100GB TPC-H DataQuery Workload 40 queries containing modified versions of Q1, Q3, Q6, Q11

19Slide20

Experimental Setting

Estimation of Weight (

w

x

)

Running all

22

TPC-H queries for a

300

GB datasetwpub ≈ 40MB/sec , w

priv ≈ 8MB/secResource Allocation CostAmazon S3 Pricing for storage and communicationStorage = $0.140/GB + PUT, Communication= $0.120/GB + GETPUT=$0.01/1000 request, GET=$0.01/10000 requestAmazon EC2 and EMR Pricing for processing$0.085 + $0.015 = $0.1/hourSensitivityCustomer : c_name, c_phone, c_address attributesLineitem: All attributes in %1-5-10 of tuples

20Slide21

Experimental Results

21Slide22

Experimental Results

22Slide23

Hadoop - Why ?

Need to process huge datasets on large clusters of computers

Very expensive to build reliability into each application

Nodes fail every day

Failure is expected, rather than exceptional

The number of nodes in a cluster is not constant

Need a common infrastructure

Efficient, reliable, easy to use

Open Source, Apache Licence version of Google File SystemSlide24

Who uses Hadoop?

Amazon/A9

Facebook

Google

It has GFS

New York Times

Veoh

Yahoo!

…. many more

ClouderaSimilar to Redhat business model.

Added services on HadoopSlide25

Commodity Hardware

Typically in 2 level architecture

Nodes are commodity PCs

30-40 nodes/rack

Uplink from rack is 3-4 gigabit

Rack-internal is 1 gigabit

Aggregation switch

Rack switchSlide26

Hadoop Distributed File System (HDFS)

Original Slides by

Dhruba Borthakur

Apache Hadoop Project Management CommitteeSlide27

Goals of HDFS

Very Large Distributed File System

10K nodes, 100 million files, 10PB

Yahoo is working on a version that can scale to large amounts of data.

Assumes Commodity Hardware

Files are replicated to handle hardware failure

Detect failures and recover from them

Optimized for Batch Processing

Data locations exposed so that computations can move to where data resides

Remember moving large data is an important bottleneck.

Provides very high aggregate bandwidthSlide28

Distributed File System

Single Namespace for entire cluster

Again this is changing soon!!!

Data Coherency

Write-once-read-many access model

Client can only append to existing files

Files are broken up into blocks

Typically 64MB block size

Each block replicated on multiple DataNodes

Intelligent Client

Client can find location of blocksClient accesses data directly from DataNodeSlide29

HDFS ArchitectureSlide30

MapReduce

Original Slides by

Owen O’Malley (Yahoo!)

&

Christophe Bisciglia, Aaron Kimball & Sierra Michells-SlettvetSlide31

MapReduce - What?

MapReduce is a programming model for efficient distributed computing

It works like a Unix pipeline

cat input | grep | sort | uniq -c | cat > output

Input | Map |

Shuffle & Sort

| Reduce | Output

Efficiency from

Streaming through data, reducing seeks

PipeliningA good fit for a lot of applicationsLog processingWeb index buildingSlide32

MapReduce - DataflowSlide33

MapReduce - Features

Fine grained Map and Reduce tasks

Improved load balancing

Faster recovery from failed tasks

Automatic re-execution on failure

In a large cluster, some nodes are always slow or flaky

Framework re-executes failed tasks

Locality optimizations

With large data, bandwidth to data is a problem

Map-Reduce + HDFS is a very effective solution

Map-Reduce queries HDFS for locations of input dataMap tasks are scheduled close to the inputs when possibleSlide34

Word Count Example

Mapper

Input: value: lines of text of input

Output: key: word, value: 1

Reducer

Input: key: word, value: set of counts

Output: key: word, value: sum

Launching program

Defines this job

Submits job to clusterSlide35

Word Count DataflowSlide36

Word Count Mapper

public static class Map extends

MapReduceBase

implements

Mapper

<

LongWritable,Text,Text,IntWritable

> {

private static final

IntWritable

one = new IntWritable(1); private Text word = new Text(); public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer

= new StringTokenizer(line);

while(tokenizer.hasNext()) {

word.set(tokenizer.nextToken());

output.collect(word,one

); }

} }Slide37

Word Count Reducer

public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {

public static void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;

while(values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}Slide38

Word Count Example

Jobs are controlled by configuring

JobConfs

JobConfs are maps from attribute names to string values

The framework defines attributes to control how the job is executed

conf.set(“mapred.job.name”, “MyApp”);

Applications can add arbitrary values to the JobConf

conf.set(“my.string”, “foo”);

conf.set(“my.integer”, 12);

JobConf is available to all tasksSlide39

Putting it all together

Create a launching program for your application

The launching program configures:

The

Mapper

and

Reducer

to use

The output key and value types (input types are inferred from the

InputFormat)The locations for your input and output

The launching program then submits the job and typically waits for it to completeSlide40

Putting it all together

JobConf

conf = new

JobConf

(

WordCount.class

);

conf.setJobName

(“

wordcount

”);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);

conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class

);Conf.setOutputFormat(

TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args

[1]));

JobClient.runJob(conf);Slide41

Input and Output Formats

A Map/Reduce may specify how it’s input is to be read by specifying an

InputFormat

to be used

A Map/Reduce may specify how it’s output is to be written by specifying an

OutputFormat

to be used

These default to

TextInputFormat

and TextOutputFormat, which process line-based text data

Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary dataThese are file-based, but they are not required to beSlide42

Vigiles

:

Fine-grained

Access Control for

MapReduce

Systems **

Murat

Kantarcioglu

Joint work

Huseyin

Ulusoy, Erman Pattuk, Kevin HamlenSlide43

Motivation

MapReduce

systems become

popular

for processing big data. (e.g., Hadoop)

Initially

security and privacy

were not the primary concerns of the

MapReduce

systemsNew challenges due to high volume and high varietyNeed to support relational, semi-structured and unstructured dataSlide44

Does NoSQL mean No Security??Slide45

Motivation

Many security applications are

being developed for different needs

Apache

Accumulo

allows multi-level access

control for key-value stores

Apache Knox

allows authorization, authentication, and auditing capabilitiesApache Argus tries to achieve “Centralized security administration to manage all security related tasks in a central

UI”“Fine” grained access control for HDFS (file level) , Hive (column level) and HBase (column level)Slide46

Why file level access is not fine-grained enough?

Given HDFS file may have significant

sensitive information

Log file for a web service supporting multiple companies contains billions of records from different users.

Allowing one user to see all the info. in a given file makes it

easier to abuse and/or misuse

the data.

No need for an employee to see company X related logs if he/she is working in company Y account.

Fine-grained access control need is addressed in relational database management systems Slide47

View-based access control in RDBMs

Define a view containing

the predicates

to select the tuples to be returned to a given

subject

S

Grant

S the select privilge on the view, and

not

on the underlying tableExample:CREATE VIEW

CompanyXLog AS SELECT * FROM Log WHERE Log.Company=X;- GRANT Select ON CompanyXLog TO Ann;Slide48

Our Previous Work: Access control for Hive**

**

Thuraisingham

et al. Secure

data storage and retrieval in the cloud.

CollaborateCom

2010: 1-8Slide49

Comparison to Apache Sentry

Apache Sentry allows view definition at the column level

Our

HiveAC

framework allows arbitrary view definition based on any predicate

Developed late 2009, we did not release an open source version

 Slide50

Fine-grained access control for non-relational

d

ata

Fine-grained access control is not supported for generic

MapReduce

computation

Example:Slide51

Goal

Propose a fine-grained access control mechanism for the generic

MapReduce

model

Provide a modular architecture to be independent of underlying

MapReduce

version

Incur the minimal overhead

Have easy to use and expressive security policy specificationSlide52

High Level Idea

Apply a predicate to all generated/read key-value pairs before the map functions process Slide53

System OverviewSlide54

FGAC Predicate Model

Let denote FGAC predicate model

denote the set of subjects

denote the set of objects

Objects are composed of atomic entries ,

denote the set of predicates

Each predicate independently run an access control filter (ACF) for each key-value pairSlide55

Security and privacy

p

olicy

d

iscussion

Any policy that is

dependent

on just

one key-value pair

could be implemented in this frameworkSanitizationGeneralize all zip codes to 3 digitDelete SSN valuesPredicate based access control including RBACAttribute based access control based on key, value pairs

Cannot enforce policies defined based on set of key-value pairsK-anonymity cannot be easily enforcedSlide56

In-lined Reference Monitor

ACFs are in-lined into

MapReduce

systemSlide57

RecordReaderSlide58

Enhanced RecordReaderSlide59

Enhanced RecordReaderSlide60

Security Discussion / Assumptions

The

Java Security sandbox

prevents unauthorized data accesses

The

injected aspects

cannot be tampered

due to private in-lining

The

presented reference monitor is small and efficient enough to be subject to formal analysis and testsSlide61

Experiments

Used a cluster with 14 nodes

Performed experiments with Hadoop 1.1.2 and

AspectJ

1.7.3

Created 5 files containing 10M…50M key-values

Each key has a security classification label

Each value is consist of a structured part (i.e., relational table) and unstructured part (i.e., arbitrary text)Slide62

ACFs

Many ACFs are generated and tested in experiments

Label ACF: Use the labels in the keys

Name sanitization ACF: Detect and sanitize the American names in the text part

Phone # sanitization ACF: Sanitize the possible phone number in the text partSlide63

Scenarios

Each experiment is run on Apache Hadoop

Without any ACF (No ACF)

With ACFs implemented in the source code of Hadoop (integrated ACF)

With ACFs injected via

AspectJ

(

Vigiles

ACF)Slide64

Single-user ExperimentsSlide65

Single-user ExperimentsSlide66

Single-user ExperimentsSlide67

Multi-user ExperimentsSlide68

Conclusion

We initially implemented

on Apache

Hadoop but can easily be extended to

other systems

Vigiles

can generate a

broad class of safety policies

Initial results indicate that Vigiles exhibits just 1% overhead compared to the implementation that modifies Hadoop’s source codeSlide69

Acknowledgement

This research is supported by NSF and Air Force Office of Scientific Research Grants.