/
The Big Data Network (phase 2) Cloud Hadoop system The Big Data Network (phase 2) Cloud Hadoop system

The Big Data Network (phase 2) Cloud Hadoop system - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
382 views
Uploaded On 2018-01-01

The Big Data Network (phase 2) Cloud Hadoop system - PPT Presentation

Presentation to the Manchester Data Science club 14 July 2016 Peter Smyth UK Data Service The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines ID: 618793

hadoop data cloud processing data hadoop processing cloud system user datasets users access big provide jobs metadata ingest spark

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Big Data Network (phase 2) Cloud Had..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Big Data Network (phase 2) Cloud Hadoop system

Presentation to the Manchester Data Science club

14 July 2016

Peter Smyth

UK Data ServiceSlide2

The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines.

The BDN2 Cloud Hadoop systemSlide3

Overview of this presentation

Aims

A simple Hadoop System

Dealing with the data

Processing

Users

Safeguarding the data and its usage

Appeal for data and use casesSlide4

AIMSSlide5

AimsProvide a processing environment for big dataTargeted at the Social Sciences - but not exclusively so

Provide easy ingest of datasets

Provide comprehensive search facilities

Provide easy access to usersfor processing or downloadSlide6

Cloud Hadoop SystemSlide7

Cloud Hadoop SystemStart with minimal configurationCloud based, so we can grow it as needed

Adding nodes is what Hadoop is good at

Need to provide HA from the outset

Resilience and user access is importantSearch facilities will be expected 24/7Slide8

Software installed - and how we will use it

Standard HDP (Hortonworks Data Platform)

Spark, Hive, Pig,

Ambari, Zeppelin etc.Other Apache software

Ranger - monitor and manage comprehensive data security across the Hadoop platform Knox – REST API Gateway providing single point of entry to the cluster

Other Software

Kerberos

AD integration

Our own processes for workflows and ingest / metadata productionSlide9

Fitting the bits together

Hadoop System

Users

Data

Job Scheduling

User Access control & quotas

Performance Monitoring

Auditing and Logging

Data Access control & SDCSlide10

DataSlide11

Getting the data inLarge datasets from 3rd parties

Existing UKDS datasets

Not necessarily big data

But likely to be used in conjunction with other dataBYOD – Bring your own dataSlide12

How not to do it!Slide13

HDF – Hortonworks Data FlowBuilt on Apache Nifi

Allows workflows to be built for collecting data from external sources

Single shot datasets

Regular updates (monthly, daily)Possibility of streaming data Slide14

NiFi workflowSlide15

Data storageRaw DataMetadata (Semantic Data Lake)Dashboards, summaries and samples

User data

Own datasets

Work in progressResultsSlide16

Semantic data lakeMust contain everythingThere will be only one search engine

Whether in the cloud or on-

prem

(secure data)The metadata isn’t just what is extracted from the datasets and associated documentationAppropriate Ontologies need to be used

Not only terms but relationships between themSlide17

ProcessingSlide18

Processing Ingest and curation processing

Extracting and creating Metadata

Processing for Dashboards, summaries and samples

Samples – in advance or as requested?User searchesUser Jobs

Processing systemsSpark

Hive / Pig

Effect of interactive environments

ZeppelinSlide19

Job Scheduling Ingest related jobsMetadata maintenance related jobs

User jobs

Batch?

HivePig(Near) Real time

SparkStreamingWhat kind of delay is acceptable?

For users

For Operations

Do we need to prioritise?Slide20

UsersSlide21

User typesShort term (try before you ‘buy’)Long term (Researchers 3-5 years )

Commercial users? (in exchange for data)

Everyone is a search userSlide22

Safeguarding DataSlide23

Security and AuditWho can access what dataMaking data available

Disc quotas

Private areas

Who has access to resources and can run jobsSandbox area for authenticated users

Providing toolsLevels of SupportWhat audit trails are maintained

What is recorded

How long do we keep the logs

Will they be reviewed?Slide24

Data Ownership and ProvenanceRestrictions on use of a datasetLicense agreements

Types of research permitted

Complications due to combining

Permissions neededCarrying the provenance/licence with the data in the semantic data lakeSlide25

SDC – Statistical Disclosure ControlsCurrently a manual processLikely to be more complex as more datasets are combined

Could just be checked on output

Automated tools are becoming available

But how good are they?Or, are they good enoughSlide26

Hadoop in the CloudSlide27

Performance monitoringNeed to understand usage patternsOr try to anticipate them

Need to be able to detect when the system is under stress - and be able to react in a timely manner

CPU

RAMHDFS

Need to provide proper job scheduling for true batch jobsCannot allow the use of Spark to result in a free-for-allSlide28

Pros and Cons of the Cloud for HadoopProsElasticity

Add or remove nodes as required

Only pay for what you use

ConsHadoop designed as a share nothing system

Adding, and particularly removing nodes not as straightforward as in other type of cloud systemsContinuously paying for storage big datasetsSlide29

Appeal for use casesSlide30

Why we need data and use casesBuilding a generalised systemMany of the processes and procedures have not been tried before

Need an understanding of ‘typical’ use needs

Need to ensure we cater for end to end processing of the user needsSlide31

What is in it for youSafe 24/7 repository for your dataAccess to Big Data processing

Support & TrainingSlide32

Peter Smyth

Peter.smyth@manchester.ac.uk

ukdataservice.ac.uk/help/

Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE

Follow us on Twitter https://twitter.com/UKDataService

or Facebook https://www.facebook.com/UKDataService

and offers of data