Hadoop @ eBay: Past, Present and Future - PowerPoint Presentation

43K - views

Hadoop @ eBay: Past, Present and Future

Ryan Hennig. Hadoop Platform Team. ABOUT ME. RYAN HENNIG. Born and raised in Seattle, WA. Studied Computer Science at University of Washington in Seattle. Worked on Microsoft SQL Server 2006 – 2012.

Embed :
Presentation Download Link

Download Presentation - The PPT/PDF document "Hadoop @ eBay: Past, Present and Future" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Hadoop @ eBay: Past, Present and Future






Presentation on theme: "Hadoop @ eBay: Past, Present and Future"— Presentation transcript:

Slide1

Hadoop @ eBay: Past, Present and Future

Ryan HennigHadoop Platform TeamSlide2

ABOUT MESlide3

RYAN HENNIG

Born and raised in Seattle, WAStudied Computer Science at University of Washington in Seattle

Worked on Microsoft SQL Server 2006 – 2012Shipped SQL Server 2008, 2008 R2, 2012

Joined eBay Hadoop team in early 2012

- Based in Bellevue, suburb of Seattle

COMPUTE AND DATA INFRASTRUCTURE

3Slide4

AGENDA

Past: Growth of Hadoop at eBayPresent

: Hadoop Use Cases, Operations ToolsFuture: Hadoop 2.0Slide5

HADOOP AT EBAY:

PASTGrowth of Hadoop at eBay

Adventures in ForkingPartnership with HortonworksSlide6

HADOOP EVOLUTION @ eBay

HADOOP AT EBAY: PAST

6

2007

Single digit nodes

2010

Shared cluster

100s nodes

1000s + core

PB

CDH2

2011

Shared clusters

1000s node

10,000+ core

10s PB

Wilma (0.20)

2012

Shared clusters

1000s node

10,000+ core

10s PB

Argon (0.22)

2013

Shared clusters

4k

+ node

40

,000+ core

50s PBHDP 1.x

2009

Search

10s

- nodesSlide7

ADVENTURES IN FORKING

2007-2010: eBay runs shared clusters on Cloudera Distribution of Hadoop2010-2012: eBay runs shared clusters on custom Hadoop versions

2010: Wilma (based on 0.20)2011: Argon (based on 0.22)2012: Custom branch abandoned

Lessons Learned

Forking a fast-changing open source project is difficult and risky

Balancing Development and operations needs

Development team size

Facebook had 100

eBay had 15Coordination with open source community = lots of overhead

Divergence from open source: Push changes early and often

HADOOP AT EBAY: PAST

7Slide8

HADOOP AT EBAY: PAST

8Slide9

EBAY AND HORTONWORKS

2012: eBay enters partnership with HortonWorks

GoalsFocus on eBay-specific development internallyLeverage HortonWorks

expertise for general Hadoop Development

Avoid source code divergence

by making open source contribution a priority

Benefits to

HortonWorks

Credibility enhanced by having a well-known customerAbility to test at large scale

HADOOP AT EBAY: PAST

9Slide10

HADOOP AT EBAY:

PRESENTShared and Dedicated Clusters

Job DistributionUse Case ExampleseBay Data Platform OverviewSlide11

SHARED AND DEDICATED CLUSTERS

Shared

clusters10s of PB and 10s of thousands of slots per clusterUsed primarily for analytics of user behavior and inventory

Mix of production and ad-hoc jobs

Mix of MR, Hive, PIG, Cascading etc.

Hadoop and

HBase

security

enabled

Dedicated clustersVery specific use cases like Index BuildingTight SLAs for jobs (in order of minutes)

Immediate revenue impact

Usually smaller than our shared clusters, but still big (100s of nodes…)

HADOOP AT

EBAY: PRESENT

11Slide12

JOB DISTRIBUTION BY TYPE

HADOOP AT EBAY: PRESENT

12Slide13

USE CASE EXAMPLES

Cassini, eBay’s new search

engine:Use MR to build full and incremental near-real-time indexesRaw Data is stored in HBase

for efficient updates and random read

Strong

SLAs: < 10 minutes

Run on dedicated clusters

Related and similar Items recommendations:

Use transactional data, click stream data, search index, etc.

Production MR jobs on a shared cluster

Analytics dashboard:

Run Mobius MR jobs to join click stream data and transactional data

Store summary data in

HBase

Web application to query

HBase

HADOOP AT EBAY: PRESENT

13Slide14

HADOOP OPERATIONS

LDAP Integration

- All users stored in Active Directory, accessed via LDAP - Access to MapReduce Queues granted via

MapReduce

queues

- Batch users: shared by a group of users

Security

- Kerberos as implemented by Microsoft Active Directory - One domain for users, another for service/server principals

- Batch

users authenticated via

keytabs

, not passwords

Misc

- 10’s of slave nodes are broken at any given time

- Often need to add several racks of machines at a time

HADOOP AT EBAY: PRESENT

14Slide15

HADOOP OPERATIONS

Team has Development and Operations Responsibilities

2 Huge shared clusters1800+ users, exponential growthAbout 10 Hadoop developers

Recently: operations work moved to dedicated team

Developed several tools to manage operations

Hadoop Management Console

: user-facing web app

ldap

-admin

: swiss-army knife style tool for

hadoop

admins

Puppet

: for adding machines to the clusters, many racks at a time

Decom

/

Recom

scripts

: automatic detection, repair, decommission, and

recommission

of slave nodes

HADOOP AT EBAY: PRESENT

15Slide16

HADOOP MANAGEMENT CONSOLE

Custom Web application built on Ruby on RailsSelf-service tools are continually added to reduce support load

User ManagementAccess RequestsGroup MembershipBatch User Management

New Requests

Sudoer

management

Dataset Management

Explore Datasets

Request New dataset transfer between Teradata and Hadoop

Metadata toolsEach dataset is stored in custom XML formatCode Generation: Hive Tables, Java POJOs

HADOOP AT EBAY: PRESENT

16Slide17

HADOOP AT EBAY: PRESENT

17Slide18

HADOOP AT EBAY: PRESENT

18Slide19

HADOOP AT EBAY: PRESENT

19Slide20

HADOOP AT EBAY: PRESENT

20Slide21

HADOOP AT EBAY: PRESENT

21Slide22

HADOOP AT EBAY: PRESENT

22Slide23

HADOOP AT EBAY: PRESENT

23Slide24

ldap-admin

Command-line tool written in Ruby

Swiss-army knife tool, features added on demand for support issuesOften used features:Add a user to a group

View key details for LDAP users and groups

List all users, batch users,

hadoop

groups

Reset batch user passwords and

keytabs

Show/add/remove sudoers for a batch accountRun user diagnostics: check permissions,

keytabs

,

etc

HADOOP AT EBAY: PRESENT

24Slide25

HADOOP AT EBAY:

FUTUREHDFS Federation

YARNNew ScenariosStorage and Operational EfficiencySlide26

HDFS HA and Federation

HDFS High-Availability for ReliabilityNameNode in Hadoop 1.0 is a Single Point of Failure

Automated failover to hot standby Depends on ZooKeeper

HDFS Federation for

Scalability and Isolation

Hadoop 1.0: Single

NameNode

service

“Secondary NameNode” is not for failoverStorage scales horizontally, but Namespace scales vertically

No isolation for different tenants or applicationsHadoop 2.0: HDFS Federation

Partition the HDFS Namespace

Many independent

NameNodes

Allows direct access to Block Storage w/o going through HDFS interface

HADOOP AT

EBAY: FUTURE

26Slide27

HDFS HA

HADOOP AT EBAY: FUTURE

27Slide28

HDFS HA

HADOOP AT EBAY: FUTURE

28Slide29

HDFS HA

HADOOP AT EBAY: FUTURE

29Slide30

HDFS Federation

HADOOP AT EBAY: FUTURE

30

Horizontal Scalability of HDFS Namespace

Multiple independent

NameNodes

serving a

subtree

of the

NameSpace

Example: NN1 provides /users, NN2 provides /reportsSlide31

YARN

HADOOP AT EBAY: FUTURE

31Hadoop 1.0:

MapReduce

JobTracker

and

TaskTracker

servicesHandles Resource Management, Job Execution

Hadoop 2.0: YARNRefactoring

Responsiblities

of

JobTracker

and

TaskTracker

into more general platform

Global

ResourceManager

Cluster-wide resource managements

Per-application

ApplicationMaster

Application-specific job controlSlide32

YARN

HADOOP AT EBAY: FUTURE

32Slide33

YARN

HADOOP AT EBAY: FUTURE

33Slide34

YARN

HADOOP AT EBAY: FUTURE

34Slide35

YARN

HADOOP AT EBAY: FUTURE

35Slide36

New Scenarios

Iterative QueryStinger (Hive), Impala, etc

Rapid Data exploration and analysisGraph DatabasesTitanDB, Giraph

Billions of vertices and edges

Complex Graph Traversals

Applications: PayPal fraud detection, Social Graph Analysis

Real-Time Processing

Storm (Twitter), Apache S4

Reinforcement Learning, Monitoring

HADOOP AT EBAY: FUTURE

36Slide37

Efficiency and Reliability

Storage EfficiencyHDFS introduces a 3x storage cost for its replicas

HDFS-RAID: more reliability for 1.5x storage costReed-SolomonLocally Repairable Codes (Project Xorbas

)

Tradeoff: the cost of repairing lost data is much higher

Operational Efficiency

More automation

More self-service tools

Better Monitoring

HADOOP AT EBAY: FUTURE

37Slide38

Open Source

HMC MetadataLong term goal: standardize on open source technologies (

HCatalog)Short term: explore what should be open sourcedHadoop Management ConsoleHadoop Access Request Automation

Batch user creation and management

Metadata management

Code generation of dataset to Hive tables and Java POJOs

l

dap_admin

toolsVery useful but tightly coupled to eBay’s LDAP configuration

Willing to open source if there is interest

HADOOP AT EBAY: FUTURE

38Slide39

THANK YOU

Questions?