Presentation to the Manchester Data Science club 14 July 2016 Peter Smyth UK Data Service The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines ID: 618793
Download Presentation The PPT/PDF document "The Big Data Network (phase 2) Cloud Had..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Big Data Network (phase 2) Cloud Hadoop system
Presentation to the Manchester Data Science club
14 July 2016
Peter Smyth
UK Data ServiceSlide2
The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines.
The BDN2 Cloud Hadoop systemSlide3
Overview of this presentation
Aims
A simple Hadoop System
Dealing with the data
Processing
Users
Safeguarding the data and its usage
Appeal for data and use casesSlide4
AIMSSlide5
AimsProvide a processing environment for big dataTargeted at the Social Sciences - but not exclusively so
Provide easy ingest of datasets
Provide comprehensive search facilities
Provide easy access to usersfor processing or downloadSlide6
Cloud Hadoop SystemSlide7
Cloud Hadoop SystemStart with minimal configurationCloud based, so we can grow it as needed
Adding nodes is what Hadoop is good at
Need to provide HA from the outset
Resilience and user access is importantSearch facilities will be expected 24/7Slide8
Software installed - and how we will use it
Standard HDP (Hortonworks Data Platform)
Spark, Hive, Pig,
Ambari, Zeppelin etc.Other Apache software
Ranger - monitor and manage comprehensive data security across the Hadoop platform Knox – REST API Gateway providing single point of entry to the cluster
Other Software
Kerberos
AD integration
Our own processes for workflows and ingest / metadata productionSlide9
Fitting the bits together
Hadoop System
Users
Data
Job Scheduling
User Access control & quotas
Performance Monitoring
Auditing and Logging
Data Access control & SDCSlide10
DataSlide11
Getting the data inLarge datasets from 3rd parties
Existing UKDS datasets
Not necessarily big data
But likely to be used in conjunction with other dataBYOD – Bring your own dataSlide12
How not to do it!Slide13
HDF – Hortonworks Data FlowBuilt on Apache Nifi
Allows workflows to be built for collecting data from external sources
Single shot datasets
Regular updates (monthly, daily)Possibility of streaming data Slide14
NiFi workflowSlide15
Data storageRaw DataMetadata (Semantic Data Lake)Dashboards, summaries and samples
User data
Own datasets
Work in progressResultsSlide16
Semantic data lakeMust contain everythingThere will be only one search engine
Whether in the cloud or on-
prem
(secure data)The metadata isn’t just what is extracted from the datasets and associated documentationAppropriate Ontologies need to be used
Not only terms but relationships between themSlide17
ProcessingSlide18
Processing Ingest and curation processing
Extracting and creating Metadata
Processing for Dashboards, summaries and samples
Samples – in advance or as requested?User searchesUser Jobs
Processing systemsSpark
Hive / Pig
Effect of interactive environments
ZeppelinSlide19
Job Scheduling Ingest related jobsMetadata maintenance related jobs
User jobs
Batch?
HivePig(Near) Real time
SparkStreamingWhat kind of delay is acceptable?
For users
For Operations
Do we need to prioritise?Slide20
UsersSlide21
User typesShort term (try before you ‘buy’)Long term (Researchers 3-5 years )
Commercial users? (in exchange for data)
Everyone is a search userSlide22
Safeguarding DataSlide23
Security and AuditWho can access what dataMaking data available
Disc quotas
Private areas
Who has access to resources and can run jobsSandbox area for authenticated users
Providing toolsLevels of SupportWhat audit trails are maintained
What is recorded
How long do we keep the logs
Will they be reviewed?Slide24
Data Ownership and ProvenanceRestrictions on use of a datasetLicense agreements
Types of research permitted
Complications due to combining
Permissions neededCarrying the provenance/licence with the data in the semantic data lakeSlide25
SDC – Statistical Disclosure ControlsCurrently a manual processLikely to be more complex as more datasets are combined
Could just be checked on output
Automated tools are becoming available
But how good are they?Or, are they good enoughSlide26
Hadoop in the CloudSlide27
Performance monitoringNeed to understand usage patternsOr try to anticipate them
Need to be able to detect when the system is under stress - and be able to react in a timely manner
CPU
RAMHDFS
Need to provide proper job scheduling for true batch jobsCannot allow the use of Spark to result in a free-for-allSlide28
Pros and Cons of the Cloud for HadoopProsElasticity
Add or remove nodes as required
Only pay for what you use
ConsHadoop designed as a share nothing system
Adding, and particularly removing nodes not as straightforward as in other type of cloud systemsContinuously paying for storage big datasetsSlide29
Appeal for use casesSlide30
Why we need data and use casesBuilding a generalised systemMany of the processes and procedures have not been tried before
Need an understanding of ‘typical’ use needs
Need to ensure we cater for end to end processing of the user needsSlide31
What is in it for youSafe 24/7 repository for your dataAccess to Big Data processing
Support & TrainingSlide32
Peter Smyth
Peter.smyth@manchester.ac.uk
ukdataservice.ac.uk/help/
Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE
Follow us on Twitter https://twitter.com/UKDataService
or Facebook https://www.facebook.com/UKDataService
and offers of data