Tecton
Tecton
Kevin Stumpf - co-founder and CTO of Tecton; ex-Uber; Michelangelo Platform creator - helped Uber operationalizes
Dispatcher - MBA from Stanford; CS degree from Harvard
Analytic Systems - typically run by a person; run in batch, daily, nightly - across data in data warehouse and data lake
Operational Systems - ML application makes automated decisions to change the user experience (product recommendations, real time pricing, personalization) - data from Data warehouse, data lake, AND streaming data
Michelangelo - 2015 start - ML purposes in production at Uber; centralized ML place - training, evaluation, put it into production, serving it in production - could do it in a couple hours
After this - looked at what an application consists of - the application that wants to make a prediction, ML model, data that you need to use to make the prediction
All 3 of these need to work in tandem
Features or Data being fed into ML model to make a prediction; raw data is a transaction log (all orders a customer placed with Uber eats, clicks on restaurants); need to turn this into high fidelity signals that could be used to make a prediction
Were good at development to production, and model training to serving, not sure about feature engineering to feature serving
Typically - data scientist works in jupyter notebooks - works with raw data dump; some python based data wrangling to aggregate it into some signals; then train that mdoel in their jupyter notebook - when they get something that works - you need to get this model into production and get the python stuff into production as well - it needs to run in real time; would have to reimplement
Feature Store - interface between raw data and models - sits on top of streaming data (kinesis/kafka) and batch data (snowflake), transforms raw data into features and serves them at low latency to your model running in production; serves historical feature values to model training pipelines using a large scale batch interface
Model Serving interface - only cares about most recent values;
Model Training interface - says what happened in the past
Data scientist is primary person using this feature store
Technical Problems of a Feature Store
Features come from different data sources - different storage characteristics, freshness, types of transformations; predcition request data (the exact data you have write before making a recommendation)
Data warehouse might have complete history, streams normally 7-14 days
Data Freshness - data warehouse tends to be less fresh; data warehouse supports batch aggregations, time window agreggations, and row level transformations
How do you calculate the feature and hand it to your model if that model has really high latency requirements?
Might not be able to run lifetime aggregation on data warehouse
Need to add a cache and decouple calculation from where you are running queries; need to then settle on freshness/cost-efficiency tradeoff of your features - if you choose very up to date features then you are constantly doing ETL to pre-compute features, however if you choose cost-efficiency you might use stale data in calculating your features
Adding in multiple data sources that come in different times (stream, batch) - gets even harder to get all on the same page
Training serving skew - feature calculations differ slightly for production serving vs. training purposes;
Training/Production Discrepancy
Feature on uber eats - most frequently ordered cuisine
New customer on uber eats - should that feature be null?
If it is null or not - you need to train your model and push it to production with consistency between how you treat the feature
You may be training inconsistently
Fix - common training / serving implementation
Timing discrepancies
Feature - how fast will your food get to you
Data - trailing 30min order count from a restaurant
If you normalize that data into a feature on a 5 min lag, then you might get the wrong distribution of order count
Then if you use real-time production data it might be confused and give back wrong answers
Data Leakage Issue
Including information in your training data set that you wouldn't know in a production system
Ex: How long will it take to travel from Palo Alto to SF; we have data on historical trips;
Feature: Is their an accident on the highway (boolean)
Might have a pretty accurate historical model;
When that model is in production - it wouldn't know after you started a trip - will a crash happen it won't know
This would cause data leakage;
Fix - focus on historical distribution of accident likelihood on way to SF
Feature Store solves these problems
Problem 2 - ML teams are stuck building complex data pipelines
Data scientist asks data engineer to build transformations - data engineer is BUSY
Feature store allows data scientist to do fulfillment on its own - feature store connection
Problem 3 - No standardization
Different data engineers - duplicating things - unnecessarily computing same feature values
Feature store - manages all of these features - can be browsed, shared, discovered between teams
Problem 4 - Data issues break models in production
Data issues break ML models in production - upstream data source could be broken - Kafka broker is down; feature drift happening - world is changing - would want to look at feature statistics of features you used to change your model; how has it changed since I put it into production
Opaque features subpopulation outages
You may look globally at performance of model and accuracy of predictions - but global prediction accuracy might not show that predictions in germany or france are not working whatsoever - but they are not big enough to hurt your global accuracy - it could be just one feature that's broken for the subpopulation
Unclear data ownership - who owns what feature - who fixes it when it breaks
Feature store - monitors your data
Components of Feature Store
Registry - browsing for features (effectively a data catalog)
Serving System - API running in production that serves features to your models
Monitoring - alerts you about features
Transformations - runs feature transformations against batch stores
Storage - stores the features
Tecton does not come with built in transformation system - tecton plugs into existing data platform (if you use AWS EMR - it may use spark or flink)
Data Storage - will be done on s3 bucket or snowflake setup;
Online serving - storing that in redis cluster or dynamo cluster
Demo
Define features as code - in a file system; back in a git repository; use python code or pyspark
Specify -
Generating training data - use python-Tecton SDK - fetch historical data and then
Fetching - fetch a feature vector at low latency using python-Tecton
How does it help my team
Build more accurate models; consistency between training and serving
Helps get models into production faster
Brings DevOps like practices to feature development
Atlassian - production customer in 2020; months to days for deploying models
Able to train new models faster
Model accuracy up by 20%
Feast - open source - self managed - originally created by GoJek - Tecton contributes to it;
Tecton - fully managed cloud service
Demo
User Interface - view of all the features you have defined; can see what transformations are being run to transform raw data into features
Monitoring info - how fresh is the data, information about the lineage/flow of this data; how its being stored
Glance of summary statistic of actual features
Services - these are a grouping of a bunch of features - used to serve features to a model that we have
Write a transformation in python - need to put this into tecton
Go into repository for tecton - create a new file to start the new transformation - write same query that we wrote in databricks
Then add code that creates the feature - materialization - says build a feature pipeline - that writes into online and offline feature stores; backfill data;
Online_enabled - for model serving
Offline-enabled - for historical access
Feature_Start_Time - when to start using data from
Schedule_interval - 30 days; when new jobs to re-calculate the features
Serving_ttl - ?
Then update feature service (the bundle of features grouped for a specific type of model) to include the new feature; call it v2; creates a variant of this feature service
Tecton auto-schedules jobs to start calculating the feature
Pass a dataframe to Tecton feature service - creates a join with the dataframe; does stuff in the background to know what the individual saw at the time the model was online before
Dataframe then can be used to train a model;
Once the model is in production - you can hit a single REST endpoint - also the same place where you got your training data; get data with one service, then run model
How are you managing the storage in low latency?
Keeping fresh values of features - in a key-value store for low latency lookups
Feast is using Redis
Tecton uses DynamoDB; for most use cases it works well
Optimizations for things like aggregations to do really quick lookups for traditionally more complicated features
Dynamo is AWS specific service - how much of this is customizable based on end user need?
Tecton today runs on AWS; built in a cloud-agnostic way
Build adapters for cloud vendor specific adapters - delta lake, bigquery, snowflake
Is this service serverless/auto-scaled?
As we run pipeline transformation service - we spin up ephemeral EMR cluster for that feature; you can limit the max EMR clusters you want to spin up;
Serving side - we are running our own kubernetes cluster - can see latency distribution - memory utilization; when the scale is getting to where it needs a new pod - use autoscaling EKS
Redis vs. Dynamo
Redis is good if total number of features you want fits in memory; or repeated fetching of the same value - serving it from memory
Dynamo - if you have a huge cardinality of features - then you want to use dynamo; cost - dynamo charges per write and read; would incur costs if you fetch the same feature value again; support DAX which is an accelerator on top of dynamo
How can I implement this in Azure?
Not there yet
Model Monitoring - features may drift - Grafana dashboard had metrics - how do you capture feature drift?
Integration natively with Great Expectations - set quality expectations and alert you when feature expectations shift
Fiddler - putting together a joint solution - those things fit together really well - those features can go to fiddler to monitor those;
TZ Questions
Where does the actual model live?
How hard is it to integrate everything upfront?
Do devs/data scientists skip a lot of this stuff
Do you have to copy and paste the transformation from databricks?
If you have a bunch of data scientists using this - does it get flooded/messy?
Do people actually define their feature upfront correctly?
How often really are models or features re-used? Idea being that BI viz's are single questions and rarely are re-used for a new data analyst; How can we measure how frequently they are being reused?
Competitors
Feast (open source)
Hopsworks (has own feature store)
Databricks (has its own)
Iguazio (has its own)
Rasgo
Kaskada (specializes on event data)
Scribble Data (data prep - Enrich is product)
AWS (has its own as part of SageMaker)
Vertex AI (GCP) (has its own feature store part of Vertex)
Molecula (FeatureBase is product)
Continual (SQL centric feature store)
Feast new release about 8 months ago
Cloud data warehouse or delta lake - modern data platforms finally offering teams ways to centralize data and re-purpose that data in ways that are really good for analytics and made it self-serve
Unlocked insane productivity for data analysts; but haven't really done a ton for ML
Ideally we want to use the same higher value data that our analysts have cleaned in our ML models; data for predictions might not be available or the same as the data in my data warehouse
Data consistency issues (i.e. data is already transformed) or data security issues and access
Data in production - could use a data warehouse for that - data warehouse not really built for realtime stuff
ML teams build painful workarounds
Re-build offline system online - use some streaming system or some online version of it; error prone though; data now going through 2 transformation pipelines; data scientists end up not owning their work in production because engineers have to be involved in setting up the complex pipeline
Feature Store - hub of data flow for ML applications; consistent transforms, handle online vs. offline, organization, simple and fast workflow
Need to serve features in production; then you also should be able to ask for a training data set
Feature store connected to stream sources, batch sources, third party data sources
Capabilities
Serving - delivering feature data to your model; consistent for training and serving
Store - online - contains freshest value of each feature; offline - contains all historical values of features - can go back in time for training; organized in a nice way for ease of joining features together
Transformations - orchestrate transformations over spark cluster; runs pre-computations, also does smart backfills on features
Monitoring - making sure quality is good, features are up to date; serving latencies, feature computation
Discovery - registry - single source of truth for features within an organization
"Feature store becomes data catalog of production ready signals"
Features have a small footprint - the feature store runs on top of snowflake
Feature stores are incremental and not all or nothing - start incrementally and then connect existing pipelines to feature stores
Willem Pienaar - creator of Feast and Tech Lead at Tecton
GCP focused - bigquery as offline store, firestore as online store (maybe firebase)
Feast can stitch columns together; ensures point in time correctness that prevents feature leakage
Bucket and a serverless firestore
Whole idea was to extend MichaelAngelo to data management - data management side of ML; really helped get into production quickly
Really helped with model sharing and setup