Murat Kantarcioglu Main Cloud Security Problems VM level attacks Exploit potential vulnerabilities in hypervisor Cloud provider vulnerabilities Eg crosssite scripting vulnerability in Salesforcecom ID: 540174
Download Presentation The PPT/PDF document "Cloud Storage Security" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Cloud Storage Security
Murat KantarciogluSlide2
Main Cloud Security Problems
VM- level attacks
Exploit potential vulnerabilities in hypervisor
Cloud provider vulnerabilities
E.g. cross-site scripting vulnerability in Salesforce.com
Phishing
Integrating cloud authentication with company authentication mechanisms
Availability
Single point of Failure
Assurance of computational integrity by cloud providerSlide3
Issues with moving data to cloud providers
Will cloud provider fight against a subpoena?
Do you trust Azure logs to show gross negligence on Microsoft part?
Contractual obligations?
If you can hack one place for espionage Gmail could be a good starting point?
Data lock-inSlide4
What is new in cloud computing security?
Too big to fail?
What if Amazon hardware is confiscated?
What if Amazon fails?
Hiding activity patterns
Using cloud for crime?
Secure cloud auditing
Mutual auditability
Snowden AffairsSlide5
Cloud Computing
Like Software as a service and DAS model offers many advantages
Better availability
Reduced Costs
Unlimited scalability and elasticity
5
Cloud Computing
Database
App Server
Code
Email
MultimediaSlide6
Hybrid Cloud
Integrates local infrastructure with public cloud resources
Hybrid Cloud
Extra Advantages
The flexibility of shifting workload to
public cloud
when the
private cloud
is overwhelmed (Cloud Bursting)
Utilizing in-house resources along with public resourcesConsSensitive data exposure Public Cloud Resource Allocation Cost (both storage and computing)
6
Public/ External
Private/ InternalSlide7
Constraints
Data & Computation Partitioning Challenge
7
s_id
name
ssn
dept
1
James
1234
CS
2
Charlie
4321
EE
3John5645
CS4Matt
8743ECON
Q1
:
SELECT name, ssn from Student
Q2
: SELECT dept, count(*) FROM Student GROUP_BY dept
Student
Q1 contains sensitive information
Q2 execution is more expensive
Sensitive
How to partition the table ?
How to split computation? Slide8
Our Hybrid Cloud Architecture
8
Relations R
Queries Q
Constraints C
R
,
Q
priv
R
pub,
Q
pub
Results for Q
priv
User Interface Layer
Statistics Gathering
Layer
Hive
Hadoop HDFS
Hive
Hadoop HDFS
Private
Public
Results for Q
pub
Data and Query Management
LayerSlide9
Design Spectrum
Data Model
Relational
, Semi-structured, Key-Value Stores, Text
Sensitivity Model
Attribute Level
, Privacy Associations, View-Based
Partitioning Models
Workload Partitioning
, Intra-query Parallelism, Dynamic WorkloadMinimization Priority
Running Time, Sensitive Data Disclosure, Monetary Cost
9Slide10
Outline of Solution
Notation
Formulate Computation Partition Problem (CPP)
Solution to CPP
Experimental Results
10Slide11
Notation
sens(R’):
The estimated number of sensitive cells in dataset R’
baseTables(q):
The estimated minimum set of data items necessary to answer query q
Є
Q
runT
x
(q): The estimated running time of query q Є Q at site x (either public or private)ORunT(Q’,Q’’): Overall execution time of queries in Q’, given that queries in Q’’ are executed on the public cloud
11Slide12
R
pub,
Q
pub
R
,
Q
priv
Detailed Hybrid Cloud Architecture
12
runT
x
(q), baseTables(q)
Statistics Gathering
Layer
Data And Query Management Layer
Computation Partitioning Module
Monetary Cost Estimator
Disclosure Risk Estimator
SR
Relations R
Queries Q
Constraints C
Hive
Hadoop HDFS
Hive
Hadoop HDFS
Private
PublicSlide13
Computation Partitioning Problem (CPP)
Find a
subset of given query workload
, and
subset of the given dataset
where
,
are user defined constraints
13Slide14
Metrics in CPP
Query Execution Time (
runT
x
(q)
)
Monetary Costs
stor(R
pub) : Storage monetary cost of the public cloud partition
proc(q) : Processing monetary cost of a public side query qSensitive Data Disclosure Risk (sens(Rpub))Estimated number of sensitive cells within Rpub
14Slide15
Solution to CPP
CPP can be simplified to only finding
Q
pub
Dynamic Programming Approach
CPP (Q, MC, DC)
=
Qpub
15
Input Query Set
Monetary Const.
Disclosure Const
.
OutputSlide16
Example
If
MC < 25
or
DC < 20
CPP({
q
1
, q2, q3}, MC, DC) = CPP({q1, q2}, MC , DC)
16
q
3
can only run on private side.
Slide17
Example
If q
3
can run on both sides
Case 1
CPP({
q
1, q2, q3}, MC, DC) = CPP({q1, q2}, MC , DC)
17
What if q
3
runs on private side.
Slide18
Example
Case 2
CPP(Q, MC, DC) = MIN_TIME (CPP( , j, k)+
q
3
)
where
MC-25 ≤ j ≤ MC-15 and DC-20 ≤ j ≤ DC-0 Choose the minimum overall running time between Case 1 and Case 2
18
What if q
3
runs on public side.
Max-Min possible monetary cost by
q3
Max-Min possible disclosure risk by
q3
Slide19
Experimental Setting
Experimental Setting
Private Cloud:
14
Nodes, located at UTD, Pentium IV,
4
GB Ram,
290-320
GB disk space
Public Cloud: 38 Nodes, located at UCI, AMD Dual Core, 8GB Ram, 631GB disk spaceHadoop
0.20.2 and Hive 0.7.1Dataset and Statistic Collection100GB TPC-H DataQuery Workload 40 queries containing modified versions of Q1, Q3, Q6, Q11
19Slide20
Experimental Setting
Estimation of Weight (
w
x
)
Running all
22
TPC-H queries for a
300
GB datasetwpub ≈ 40MB/sec , w
priv ≈ 8MB/secResource Allocation CostAmazon S3 Pricing for storage and communicationStorage = $0.140/GB + PUT, Communication= $0.120/GB + GETPUT=$0.01/1000 request, GET=$0.01/10000 requestAmazon EC2 and EMR Pricing for processing$0.085 + $0.015 = $0.1/hourSensitivityCustomer : c_name, c_phone, c_address attributesLineitem: All attributes in %1-5-10 of tuples
20Slide21
Experimental Results
21Slide22
Experimental Results
22Slide23
Hadoop - Why ?
Need to process huge datasets on large clusters of computers
Very expensive to build reliability into each application
Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
Need a common infrastructure
Efficient, reliable, easy to use
Open Source, Apache Licence version of Google File SystemSlide24
Who uses Hadoop?
Amazon/A9
Facebook
Google
It has GFS
New York Times
Veoh
Yahoo!
…. many more
ClouderaSimilar to Redhat business model.
Added services on HadoopSlide25
Commodity Hardware
Typically in 2 level architecture
Nodes are commodity PCs
30-40 nodes/rack
Uplink from rack is 3-4 gigabit
Rack-internal is 1 gigabit
Aggregation switch
Rack switchSlide26
Hadoop Distributed File System (HDFS)
Original Slides by
Dhruba Borthakur
Apache Hadoop Project Management CommitteeSlide27
Goals of HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10PB
Yahoo is working on a version that can scale to large amounts of data.
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Remember moving large data is an important bottleneck.
Provides very high aggregate bandwidthSlide28
Distributed File System
Single Namespace for entire cluster
Again this is changing soon!!!
Data Coherency
Write-once-read-many access model
Client can only append to existing files
Files are broken up into blocks
Typically 64MB block size
Each block replicated on multiple DataNodes
Intelligent Client
Client can find location of blocksClient accesses data directly from DataNodeSlide29
HDFS ArchitectureSlide30
MapReduce
Original Slides by
Owen O’Malley (Yahoo!)
&
Christophe Bisciglia, Aaron Kimball & Sierra Michells-SlettvetSlide31
MapReduce - What?
MapReduce is a programming model for efficient distributed computing
It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map |
Shuffle & Sort
| Reduce | Output
Efficiency from
Streaming through data, reducing seeks
PipeliningA good fit for a lot of applicationsLog processingWeb index buildingSlide32
MapReduce - DataflowSlide33
MapReduce - Features
Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input dataMap tasks are scheduled close to the inputs when possibleSlide34
Word Count Example
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to clusterSlide35
Word Count DataflowSlide36
Word Count Mapper
public static class Map extends
MapReduceBase
implements
Mapper
<
LongWritable,Text,Text,IntWritable
> {
private static final
IntWritable
one = new IntWritable(1); private Text word = new Text(); public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer
= new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one
); }
} }Slide37
Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
public static void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}Slide38
Word Count Example
Jobs are controlled by configuring
JobConfs
JobConfs are maps from attribute names to string values
The framework defines attributes to control how the job is executed
conf.set(“mapred.job.name”, “MyApp”);
Applications can add arbitrary values to the JobConf
conf.set(“my.string”, “foo”);
conf.set(“my.integer”, 12);
JobConf is available to all tasksSlide39
Putting it all together
Create a launching program for your application
The launching program configures:
The
Mapper
and
Reducer
to use
The output key and value types (input types are inferred from the
InputFormat)The locations for your input and output
The launching program then submits the job and typically waits for it to completeSlide40
Putting it all together
JobConf
conf = new
JobConf
(
WordCount.class
);
conf.setJobName
(“
wordcount
”);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class
);Conf.setOutputFormat(
TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args
[1]));
JobClient.runJob(conf);Slide41
Input and Output Formats
A Map/Reduce may specify how it’s input is to be read by specifying an
InputFormat
to be used
A Map/Reduce may specify how it’s output is to be written by specifying an
OutputFormat
to be used
These default to
TextInputFormat
and TextOutputFormat, which process line-based text data
Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary dataThese are file-based, but they are not required to beSlide42
Vigiles
:
Fine-grained
Access Control for
MapReduce
Systems **
Murat
Kantarcioglu
Joint work
Huseyin
Ulusoy, Erman Pattuk, Kevin HamlenSlide43
Motivation
MapReduce
systems become
popular
for processing big data. (e.g., Hadoop)
Initially
security and privacy
were not the primary concerns of the
MapReduce
systemsNew challenges due to high volume and high varietyNeed to support relational, semi-structured and unstructured dataSlide44
Does NoSQL mean No Security??Slide45
Motivation
Many security applications are
being developed for different needs
Apache
Accumulo
allows multi-level access
control for key-value stores
Apache Knox
allows authorization, authentication, and auditing capabilitiesApache Argus tries to achieve “Centralized security administration to manage all security related tasks in a central
UI”“Fine” grained access control for HDFS (file level) , Hive (column level) and HBase (column level)Slide46
Why file level access is not fine-grained enough?
Given HDFS file may have significant
sensitive information
Log file for a web service supporting multiple companies contains billions of records from different users.
Allowing one user to see all the info. in a given file makes it
easier to abuse and/or misuse
the data.
No need for an employee to see company X related logs if he/she is working in company Y account.
Fine-grained access control need is addressed in relational database management systems Slide47
View-based access control in RDBMs
Define a view containing
the predicates
to select the tuples to be returned to a given
subject
S
Grant
S the select privilge on the view, and
not
on the underlying tableExample:CREATE VIEW
CompanyXLog AS SELECT * FROM Log WHERE Log.Company=X;- GRANT Select ON CompanyXLog TO Ann;Slide48
Our Previous Work: Access control for Hive**
**
Thuraisingham
et al. Secure
data storage and retrieval in the cloud.
CollaborateCom
2010: 1-8Slide49
Comparison to Apache Sentry
Apache Sentry allows view definition at the column level
Our
HiveAC
framework allows arbitrary view definition based on any predicate
Developed late 2009, we did not release an open source version
Slide50
Fine-grained access control for non-relational
d
ata
Fine-grained access control is not supported for generic
MapReduce
computation
Example:Slide51
Goal
Propose a fine-grained access control mechanism for the generic
MapReduce
model
Provide a modular architecture to be independent of underlying
MapReduce
version
Incur the minimal overhead
Have easy to use and expressive security policy specificationSlide52
High Level Idea
Apply a predicate to all generated/read key-value pairs before the map functions process Slide53
System OverviewSlide54
FGAC Predicate Model
Let denote FGAC predicate model
denote the set of subjects
denote the set of objects
Objects are composed of atomic entries ,
denote the set of predicates
Each predicate independently run an access control filter (ACF) for each key-value pairSlide55
Security and privacy
p
olicy
d
iscussion
Any policy that is
dependent
on just
one key-value pair
could be implemented in this frameworkSanitizationGeneralize all zip codes to 3 digitDelete SSN valuesPredicate based access control including RBACAttribute based access control based on key, value pairs
Cannot enforce policies defined based on set of key-value pairsK-anonymity cannot be easily enforcedSlide56
In-lined Reference Monitor
ACFs are in-lined into
MapReduce
systemSlide57
RecordReaderSlide58
Enhanced RecordReaderSlide59
Enhanced RecordReaderSlide60
Security Discussion / Assumptions
The
Java Security sandbox
prevents unauthorized data accesses
The
injected aspects
cannot be tampered
due to private in-lining
The
presented reference monitor is small and efficient enough to be subject to formal analysis and testsSlide61
Experiments
Used a cluster with 14 nodes
Performed experiments with Hadoop 1.1.2 and
AspectJ
1.7.3
Created 5 files containing 10M…50M key-values
Each key has a security classification label
Each value is consist of a structured part (i.e., relational table) and unstructured part (i.e., arbitrary text)Slide62
ACFs
Many ACFs are generated and tested in experiments
Label ACF: Use the labels in the keys
Name sanitization ACF: Detect and sanitize the American names in the text part
Phone # sanitization ACF: Sanitize the possible phone number in the text partSlide63
Scenarios
Each experiment is run on Apache Hadoop
Without any ACF (No ACF)
With ACFs implemented in the source code of Hadoop (integrated ACF)
With ACFs injected via
AspectJ
(
Vigiles
ACF)Slide64
Single-user ExperimentsSlide65
Single-user ExperimentsSlide66
Single-user ExperimentsSlide67
Multi-user ExperimentsSlide68
Conclusion
We initially implemented
on Apache
Hadoop but can easily be extended to
other systems
Vigiles
can generate a
broad class of safety policies
Initial results indicate that Vigiles exhibits just 1% overhead compared to the implementation that modifies Hadoop’s source codeSlide69
Acknowledgement
This research is supported by NSF and Air Force Office of Scientific Research Grants.