Snowflake
Shared Nothing Architecture
Only eventually consistent data consistency
Older Data warehouses
Shared Disk architecture
Multiple nodes accessing single storage
Single point of failure
Cannot scale easily
Vertical scaling
Hadoop
Storage and compute was tightly coupled - HDFS and Map reduce
Spark
Memory computation engine; runs on top of hadoop; read write to S3
Cloudera Impala
MPP - shared nothing architecture
It does not store the data - use Hive to store
No failover for queries that fail (need to restart)
Cassandra - Shared nothing architecture
Vertica
MPP - scaling out and scaling up are hard
Snowflake Architecture
Multi-cluster shared data architecture - combines shared nothing and shared disk architecture
Data Storage Layer
Underlying cloud storage (S3/Azure Blob/GCP Bucket); virtually infinite
Compressed and encrypted; with redundancy; pay only for stored data
Compute and Processing Layer
Query engine or virtual layer
Underlying VM's (EC2, Azure/GCP Vms)
Scale up and down (accomodate almost any workload)
Different size Vms for different workloads
Cloud Service Layer (Brain of Snowflake)
Auth and authoring
User and session management
Query compilation, optimization, data caching
Virtual Warehouse management
Metadata management
Storage Layer
Snowflake stores data in database systems
Tables - permanent, temp, transient - standard or materialized views
Databases are organized in schema; any SQL or Semi-structured (JSON, Parquet)
Data loaded into table
Data converted into optimized columnar proprietary snowflake
This gives fast efficiences, workalods, low compute
Add encryption
Data loaded to cloud storage layer
Data is only accessible via SQL
Data storage cost is calculated by average amount of data
Compute Layer
Queries exectued at this layer
Compute machine needs to be providionsed (virtual warehouse)
Virtual warehouse has access to storage layer
Creating virtual warehouse
You give name and size and Snowflake handles provisioning and configs
Virtual warehouse can be saled up adn down; can double size of compute without interrupting query
Snowflake handles concurrency for virtual warehouses running at same time
Cloud Service Layer
Snowflake paper - "Elastic Data Warehouse"
Data warehouse - different from OLTP - OLAP scans large amounts of data; OLTP deals with a lot less data
Oracle data warehouse, SQL server, Vertica, Netezza, Teradata
Late 2000's; cloud - data warehouse on cloud; multi-tenant database; transactional
Security baked in at beginning
Elastic focus
Everyone who wanted to build a database said - lets take postgres, rip out storage layer and then go from there
Snowflake decided to build it from scratch; no other SQL database is running this
Runs on AWS and Azure
100m queries per day
Ashish came from oracle; 70% of oracle's code was in storage layer and everything around that; 30% in query optimizer
Cloud was the 70% - it was basically already written for us; we had unlimited compute (both demand/relinquish)
Oracle and other databases built and sold for boxes; quarter rack and it takes 3 months to work with it; 3 months of planning; 3 months to move data
ETL
Batch job - transfer some data from OLTP to OLAP systems; it would run every night; people started generating a lot more data and wanted all of it in their database
Data was coming from salesforce and unstructured or semi-structed data
Google, Facebook - built their own systems to handle data systems
Hive, Spark, BigQuery, Impala, Blink
They weren't really built for security
We chose not to use any Hadoop primitives; we questioned everything that hadoop was doing;
This was 2012 - cloud database wasn't a given at that time; you could pick a dozen solutions doing hadoop based solutions and only one company wanted to do it from scratch
Vision
Data warehouse as a service; SQL front end
Multi-dminsentional elasticity
Support for multiple data types
Shared-Nothing Architecture (as compared to Shared-Disk architecture)
People liked them but they came with a huge set of problems with local storage
Do local join locally; as soon as you colocated compute and storage - any time you run out of storage you have to redistribute data
Perils of Coupling
There was a lot of competition for resources; we wanted to separate compute and storage
Multi-cluster, Shared-Data Architecture
Data in center - on service like S3 (JSON) - turned into FTN (Snowflake in french)
Provision a virtual warehouse - run queries through the warehouse - data brought there for the purpose of the query; only paying for the compute and a small amount for the storage
You can have different clusters for different types of workloads - ETL used to be restricted by computer power as well
Separate data for your queuing and testing environment
Life of a query
Large amount of metadata component here
Data stored in blob storage - set of immutable files; all the other catalog, accounts, users, sessions, billing, SQL statements - all really stored in metadata; separate system and available to all services on this warehouse
Allows us to efficiently do a clone
No data silo's; unlimited availability
Get close to shared nothing cache locality - why is this? Is it because different users are pulling data and caching it
Architecture
Snowflake account - log in and provision a warehouse; execute a query- sent to optimizer, creates a plan, goes to virtual warehouse - fetch the same plan and execute it (and write it on S3) in virtual warehouse
Workers are stateless - get a set of files they are told to read, cache them locally; when that query is done they ask if there is another query to be done
Data Storage Layer
Amazon S3; built of immutable files - easier to cache them better
Transaction snapshot isolation was also much easier
S3 is highly available; performance not great but we do local caching to help
Table Files
Pax - hybrid columnar storage (paper on it)
Each column stored separately with offsets to each file; we do various type of compression between the files; for numbers - bayesian coding; strings - try dexterian coding; JSON - fancy try as well
16 MB in size
Header - points to each column - fetch part of a file
Other data
Store results of query on S3 - if I did select * on a 1 billion row table; most databases would break - we push those results to S3 and return that result to the client
You can then query on your own results
Enabled by metadata and storage
Metadata stored in transactional key value store; transactions done at global levels, lock tables, optimizer stats
Metadata store is foundationDB
Virtual Warehouse
Hid what resources really were
Worker Nodes
Query comes in, compiler determines from metadata which set of partitions belong to which table, it does pruning using zone maps, removes micropartitions, consistent hashes those micropartitions onto the node of a virtual warehouse
Expect caches on warehouse to work effectively
Resources on the cloud are sorta flakey because you are running multiple workloads on the same machine
Need to deal with stragglers - 8 out of 9 pipelines are done but could be one thats not
Stealing - consistent hashed that as well; another round of consistent hashing; if one specific node is consistently slow - then node that would still work
Execution Engine
Columnar - MonetDB
Vectorized - pioneer of vectorization for databases; had to vectorize snowflake; selection vectors (dont materalize intermediate results)
Push based model - instead of volcano style like Oracle which are iterator based; it became clear that most databases should be written with pushed base; Neumann paper good for push based databases
Operate on batches of data; plan moves from down upwards; could be DAGs and not just trees; Update function - if you decide to modify a single row of a table, need to read entire partition - write everything from that row back; 2 pipelines - all data that needs to be updated in one pipeline vs. data that doesn't in other pipeline; can split these things - DAG-based plans were super helpful (impossible without push based approach)
Basically can split up things that do and don't need to be modified - allow more speed
No buffer pool - decided that reserving memory for micro-partitions was not super valuable;
Self-tuning and Self-Healing
Auto memory management - compiler decides how much memory it should get; operator tries to use as much memory as possible in the beginning
Take all the resources as possible to finish a query
Operators can grow and shrink their memory
Parallelism chosen based on what we see
At run-time make adaptive decisions because query optimizer could be wrong
Join order is set by planner
Automatic Skew Avoidance
We gather data and see what is popular; we are always on in terms of being adaptive
Cloud Service
We have high availability; cloud service turns out to be harder to do than the database
Cloud service needs to be online 24/7 and online upgrades - people dancing on the dance floor and you need to upgrade the dance floor every week and the dancers can't know that you are doing it
Concurrency Control
Transactions done via metadata
Implement snapshot isolation
Immutable S3 files make life easier - keep track of set of tables that belong to a table (DML operation?)
Rather than have an undo log - just keep track of a set of file operations
Allows us to do time travel - says what the files are one day ago; 30 days ago, etc. - all you are incurring are extra storage costs
People use it to see changes in table;
Use it for making copies of data - I want to make a clone of data as of 7 days ago
Don't version schema the same way though
Cloning
Cloning an object is easier; make a copy of metadata
Pruning
Key contribution to product - zone maps are part of metadata
Can run a number of queries that can be done from metadata
Compiler looks at zone maps
We implemented a lot of operators - almost every operator works with zone map - min and max and distinct values
Zone maps are in-memory - they were originally in foundationDB but now in S3
Only the brain (cloud services) is written in java - metadata, compiler, access control, security, transactions; muscle is virtual warehouses that are executing the plan
Muscle in C++
To build a Restful service there are a lot more libraries available to use with Java; there are enough cloud services written C++ only; if we redid this today - we would not use java today
The bulk of time is spent in c++ code for our queries; 1 minute or 2-3 seconds - all the work done by C++ layer; java layer is hopefully hundreds of miliseconds
Number of zones is pre-determined - we determine size of partition and then we have that
Maintained some bloom filters but they got too large over time;
No indexes at all - not uncommon - a billion rows you can have an index; one trillion rows it becomes a lot harder to deal with
SaaS experience
Built a web UI; much easier to do things in UI - billing, access control, utilization
We did a lot of sampling - got metrics on operators, queries; people from support can help customers with this data
Support mySQL, Netezza, Teradata language for easy porting; wrote our own SQL - used Antler; our grammar was frankenstein grammar (a bit of everything)
Continuously Availability
Points of failure are external services - US West (1,000 brains, 5,000 workers with multi-processors) - a failure of a single one shouldn't bring
Semi-Structured data
Realized we needed
Huge differentiator - we have really good JSON support - VARIANT is JSON
Can do joins and group bys on it; flatten it - turn it into individual rows;
Schema-Less
Bigquery/dremel / impala needed schema upfront
Snowflake can do automatic type inference and columnar storage for schema-less data
Security
Encrypt data at rest; had to prove to them that we'd be way more secure than they'd ever be
Fine grain access could be granted to users
Key management is hierarchical; we can't read the data without their key
Post Sigmod 16 features
Data sharing - uses metadata and access controls and it just sort of works
Ingestion service - doesn't require compute upfront - put data in S3 and we will just ingest it
Lessons learned
Controversial to build a Database that wasn't hadoop based in 2012
Users love no-tuning aspect
Core performance not as important as ease of use, easy ingestion
Challenges
Multi-tenancy, multi-cloud, replication, data lake support, materialized views
Performance
Q: Anything that you regret doing? Engineering time sinks?
Cloud Services has been super painful - when you have to compile queries - it does a fair amount of work and it takes a lot of resources - provisioning for those resources and separating for those resources is a lot harder
Really hard to tell how much memory
Cloud service is multi-tenant - java doesn't have a lot of insight into query memory usage