Data Catalog
Data Catalog
Alation Overview
https://www.youtube.com/watch?v=sPqeMCvW8TE&t=61s&ab_channel=GreatDataMinds
Repository of metadata; helps with data governance, collaboration, analysis
Contains data sets reports, queries, of all info stored in a data lake
Glossaries, lineage
Helps find information, understand whether its stale or not - want to create a single source of truth
Data literacy - proper interpretation
Data governance - responsibility, authority
Why now?
Data explosion
Changing workforce
Evolving data privacy laws
Use cases
Analytics, governance, cloud migration, data privacy/gdpr, risk and compliance, digital transformation
American Family insurance is a big customer
Catalogs
Follow
Alation
Universal search bar
Ability to find information, star information, watch information; functional resources
Query - look for assets - tables, columns, schemas, BI reports; conversations from users in the catalog; focused on data analyst productivity initially;
Allow users to depricate reports; gives a steward ownership of the data; gives specific lineage on where data is coming from and going to
Warnings - tell user how to use the data and how to distribute / not distributing;
Lineage - use it to reverse engineer where the data is coming from; can see who the top users are for an individual data set; stewards are able to govern the data for HIPAA/PII
Queries become assets in the catalog; can go all the way through to IDE for SQL queries that carries warnings directly to the IDE; joins and filters are also put into data catalog
See most frequently queried columns; good for knowing which columns to migrate
Lineage - shows impact of all downstream assets; there could be a staging table that is upstream and feeding data; could help understand what really needs to be migrated
All assets need to be ingested into the catalog; REST API for lineage
Matching to other columns and aliases; machine learning will look for suggested terms; if they are in the catalog and discoverable its a high probability that alation will make the connection
Data governance implementation
We enable visual governance - can enable stewards to take ownership in catalog; some data tables don't need stewards but a lot of the time you need to attach stewards; stewardship dashboard - can add stewards - could examine them as well; can send messages to data analysts in the platform
Support for semi-structured, unstructured
We can make it a catalog source if it can be described with a little bit of JSON
Follow up questions
How much effort / commitment does it take to get data into alation?
What percentage of people using the platform are BI analysts, data analysts, and business leaders?
New connectors or new products or new features?
Future - data privacy, data discovery, data sharing, and data acquisition
Reason for building
It was an inventory management system held by IT
Data governance spurred by regulations
New application
Data governance off the ground much quicker than ever before
Automated ingestion and auto-matic role definition; policies are no longer manually entered and centralized
Multi-cloud governance and security
Differences can diversify IT portfolios
Automate / ML - at the point of usage people knows what they are allowed to do; they know policy has been applied to database
Human brilliance -
Snowflake - roadmap; policies and control from snowflake into alation; another plane to manage data inside of enterprise;
Federated data governance; data sharing as well - getting up and running quickly because queries may already be built
Single source of reference