/
SCALING FEATURE GENERATION SCALING FEATURE GENERATION

SCALING FEATURE GENERATION - PDF document

nicole
nicole . @nicole
Follow
343 views
Uploaded On 2021-01-05

SCALING FEATURE GENERATION - PPT Presentation

FROM PROTOTYPING TO PRODUCTION AT REWE REWE Systems GmbH March 2019 Benjamin Greve AGENDA March 2019 Scaling Feature Generation 2 1 Introduction 2 Example Project Predicting Brand M arket ID: 826365

sql feature 2019 generation feature sql generation 2019 scaling march data rewe script long market brand scripts introduction prototyping

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "SCALING FEATURE GENERATION" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

SCALING FEATURE GENERATIONFROM PROTOTYP
SCALING FEATURE GENERATIONFROM PROTOTYPING TO PRODUCTIONAT REWEREWE Systems GmbH | March 2019 | Benjamin GreveAGENDAMarch 2019Scaling Feature Generation21 /Introduction2 / Example Project: Predicting Brand Market Fit3 / Feature Generation in Prototyping –The Problems4 /Moving

From Prototyping to Production –Less
From Prototyping to Production –Lessons LearnedINTRODUCTION / MEMarch 2019Scaling Feature Generation3Benjamin GreveData Scientist at REWE Systems•Mathematician•Working in data science projects for 5+ years (2 years at REWE)•Likes hiking, climbing, dancing, cooking (especially

the eating part)INTRODUCTION/ REWEGRO
the eating part)INTRODUCTION/ REWEGROUPMarch 2019Scaling Feature Generation4Source: https://www.rewe-group-geschaeftsbericht.de, 14.03.2019INTRODUCTION/ REWEGROUPMarch 2019Scaling Feature Generation515,300MARKETS 2017 57.8Billion EuroTOTAL REVENUE 2017345,000EMPLOYEES 20

17REWE Group facts (2017)INTRODUCTION
17REWE Group facts (2017)INTRODUCTION/ REWE SYSTEMSMarch 2019Scaling Feature Generation6�1 Bil.data setsevery day30locations200,000users30,000cash registers1,200IT specialists7,500marketsINTRODUCTION / REWE SYSTEMSScaling Feature Generation7March 2019EXAMPLE PROJECT: PR

EDICTING BRAND MARKET FITMarch 2019Sca
EDICTING BRAND MARKET FITMarch 2019Scaling Feature Generation8We want to predict how well a brand will be received in a particular market to assist category managers in selecting the right brands for each market.BRAND / CATEGORY:Aset of products that form a logical group. Can contain betwee

n one and a few thousand products.y =
n one and a few thousand products.y = 1.13y = 0.92y = 1.27FEATURES x1, x2, …•Popularity of wider category in market•Number of competing articles•Location of market, incl. demographical information•…Technical definition01more popularless popularTARGET VARIABLE y:

Brand popularityin current marketcompa
Brand popularityin current marketcompared to averageacross allREWEmarketsFEATURE GENERATION IS CENTRAL PART IN THE DATA SCIENCE WORKFLOWMarch 2019Scaling Feature Generation9CRISP-DM: Cross-industry standard process for data miningDataWarehouse?FeatureGenerationPredictivemodelPre

dictions„Everything you do with the
dictions„Everything you do with the data before you apply a predictive model“https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining#/media/File:CRISP-DM_Process_Diagram.pngA TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPTMarch 2019Scaling Feature Generat

ion10OUTPUT #1: Very long SQL scriptâ
ion10OUTPUT #1: Very long SQL script•2000+ lines of SQL•40+ intermediate tables•interdependencies•inconsistent naming, chaotic code style•not optimized for performance•not robust→not production-ready or scalableOUTPUT#2: Very long R or Python script•Different top

ic for another talk.sqlPrototyping per
ic for another talk.sqlPrototyping performed by Data ScientistsA TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPTMarch 2019Scaling Feature Generation11OUTPUT #1: Very long SQL scriptOUTPUT #1: Very long SQL script•2000+ lines of SQL•40+ intermediate tables•interdependencies

•inconsistent naming, chaotic code s
•inconsistent naming, chaotic code style•not optimized for performance•not robust→not production-ready or scalableAVOID LONG, MONOLITHIC SQL SCRIPTSMarch 2019Scaling Feature Generation12.sqlWeaknesses:•Duplicate/redundant SQL for every feature•Parameters hidden within

the scripts•Adding new features is
the scripts•Adding new features is hard due to interdependenciesone very longSQL scripttraining datatable.sql.sqltwo very long, redundant SQL scriptstraining datatablescoring datatableStrengths:✓Highly modular and scalable✓Parameters are centrally defined✓Feature scr

ipts are mostly independent, making it
ipts are mostly independent, making iteasy to add new featuresUSE A MODULAR FEATURE GENERATION INSTEADMarch 2019Scaling Feature Generation13Input scriptCreates input tablei.e. table of all market-brand-combinations for which to calculate featuresFeaturescript AFeaturescript BFeatu

rescript CMarketBrandFeature A1Feat
rescript CMarketBrandFeature A1Feature A2Feature B1…100000012000110000002200011000000320001merge features into feature storeDerived helper tables•Distinct list of brands•Mapping brands to articles•….sql.sql.sql.sql.sqlFeature StoreDECOMPOSING COMPLEX SQL WITHIN

KNIME LEADS TO COMPLEX WORKFLOWSMarch
KNIME LEADS TO COMPLEX WORKFLOWSMarch 2019Scaling Feature Generation14✓Visual, KNIME native•Gets too complex as SQL logic grows•Hard to translate into standard data warehouse ETLs✓Code✓Powerful version control(releases)✓Analysts fluent in SQL•Harder to explain/show

Example workflow from KNIME Examples se
Example workflow from KNIME Examples serverA DEPLOYMENT WORKFLOW PULLS SQL SCRIPTS FROM VERSION CONTROL TO KNIME SERVERMarch 2019Scaling Feature Generation15A fully automated deployment workflow copies snapshot or release versions from the version control system to the KNIME ServerDownload

SQL files from version control system
SQL files from version control system•To performFeature Generation, execute a set of SQL scripts in a given order•Just write down the name of the files to be executed in the correct order and run the workflowIT’S VERY SIMPLE TO EXECUTE THE DEPLOYED SQL FILESMarch 2019ScalingFeatu

re Generation16Useful metanodes help t
re Generation16Useful metanodes help to implement a staging concept, e.g.:Select staging environment →obtain connection to the corresponding databaseHELPFUL METANODES EXAMPLE: GET DATABASE CONNECTIONMarch 2019Scaling Feature Generation17NEXT STEPSMarch 2019Scaling Feature Generation18

Currently:Next level:Feature dependen
Currently:Next level:Feature dependency graph•Automatic dependencyresolution•Already built into KNIME!•specify features in arbitrary order•specify scripts manually•order manually to take dependencies into accountTHANKSMarch 2019Scaling Feature Generation19Thanks for your