Snowflake

  • Shared Nothing Architecture

    • Only eventually consistent data consistency

  • Older Data warehouses

    • Shared Disk architecture

      • Multiple nodes accessing single storage

      • Single point of failure

      • Cannot scale easily

      • Vertical scaling

  • Hadoop

    • Storage and compute was tightly coupled - HDFS and Map reduce

  • Spark

    • Memory computation engine; runs on top of hadoop; read write to S3

  • Cloudera Impala

    • MPP - shared nothing architecture

    • It does not store the data - use Hive to store

    • No failover for queries that fail (need to restart)

  • Cassandra - Shared nothing architecture

  • Vertica

    • MPP - scaling out and scaling up are hard

  • Snowflake Architecture

    • Multi-cluster shared data architecture - combines shared nothing and shared disk architecture

    • Data Storage Layer

      • Underlying cloud storage (S3/Azure Blob/GCP Bucket); virtually infinite

      • Compressed and encrypted; with redundancy; pay only for stored data

    • Compute and Processing Layer

      • Query engine or virtual layer

      • Underlying VM's (EC2, Azure/GCP Vms)

      • Scale up and down (accomodate almost any workload)

      • Different size Vms for different workloads

    • Cloud Service Layer (Brain of Snowflake)

      • Auth and authoring

      • User and session management

      • Query compilation, optimization, data caching

      • Virtual Warehouse management

      • Metadata management

  • Storage Layer

    • Snowflake stores data in database systems

    • Tables - permanent, temp, transient - standard or materialized views

    • Databases are organized in schema; any SQL or Semi-structured (JSON, Parquet)

    • Data loaded into table

      • Data converted into optimized columnar proprietary snowflake

        • This gives fast efficiences, workalods, low compute

        • Add encryption

        • Data loaded to cloud storage layer

        • Data is only accessible via SQL

        • Data storage cost is calculated by average amount of data

  • Compute Layer

    • Queries exectued at this layer

    • Compute machine needs to be providionsed (virtual warehouse)

    • Virtual warehouse has access to storage layer

    • Creating virtual warehouse

      • You give name and size and Snowflake handles provisioning and configs

      • Virtual warehouse can be saled up adn down; can double size of compute without interrupting query

      • Snowflake handles concurrency for virtual warehouses running at same time

  • Cloud Service Layer


  • Snowflake paper - "Elastic Data Warehouse"

  • Data warehouse - different from OLTP - OLAP scans large amounts of data; OLTP deals with a lot less data

  • Oracle data warehouse, SQL server, Vertica, Netezza, Teradata

  • Late 2000's; cloud - data warehouse on cloud; multi-tenant database; transactional

    • Security baked in at beginning

    • Elastic focus

  • Everyone who wanted to build a database said - lets take postgres, rip out storage layer and then go from there

  • Snowflake decided to build it from scratch; no other SQL database is running this

    • Runs on AWS and Azure

    • 100m queries per day

  • Ashish came from oracle; 70% of oracle's code was in storage layer and everything around that; 30% in query optimizer

  • Cloud was the 70% - it was basically already written for us; we had unlimited compute (both demand/relinquish)

  • Oracle and other databases built and sold for boxes; quarter rack and it takes 3 months to work with it; 3 months of planning; 3 months to move data

  • ETL

    • Batch job - transfer some data from OLTP to OLAP systems; it would run every night; people started generating a lot more data and wanted all of it in their database

    • Data was coming from salesforce and unstructured or semi-structed data

  • Google, Facebook - built their own systems to handle data systems

    • Hive, Spark, BigQuery, Impala, Blink

    • They weren't really built for security

  • We chose not to use any Hadoop primitives; we questioned everything that hadoop was doing;

  • This was 2012 - cloud database wasn't a given at that time; you could pick a dozen solutions doing hadoop based solutions and only one company wanted to do it from scratch

  • Vision

    • Data warehouse as a service; SQL front end

    • Multi-dminsentional elasticity

    • Support for multiple data types

  • Shared-Nothing Architecture (as compared to Shared-Disk architecture)

    • People liked them but they came with a huge set of problems with local storage

      • Do local join locally; as soon as you colocated compute and storage - any time you run out of storage you have to redistribute data

  • Perils of Coupling

    • There was a lot of competition for resources; we wanted to separate compute and storage

  • Multi-cluster, Shared-Data Architecture

    • Data in center - on service like S3 (JSON) - turned into FTN (Snowflake in french)

    • Provision a virtual warehouse - run queries through the warehouse - data brought there for the purpose of the query; only paying for the compute and a small amount for the storage

    • You can have different clusters for different types of workloads - ETL used to be restricted by computer power as well

    • Separate data for your queuing and testing environment

  • Life of a query

    • Large amount of metadata component here

    • Data stored in blob storage - set of immutable files; all the other catalog, accounts, users, sessions, billing, SQL statements - all really stored in metadata; separate system and available to all services on this warehouse

    • Allows us to efficiently do a clone

    • No data silo's; unlimited availability

    • Get close to shared nothing cache locality - why is this? Is it because different users are pulling data and caching it

  • Architecture

    • Snowflake account - log in and provision a warehouse; execute a query- sent to optimizer, creates a plan, goes to virtual warehouse - fetch the same plan and execute it (and write it on S3) in virtual warehouse

    • Workers are stateless - get a set of files they are told to read, cache them locally; when that query is done they ask if there is another query to be done

  • Data Storage Layer

    • Amazon S3; built of immutable files - easier to cache them better

    • Transaction snapshot isolation was also much easier

    • S3 is highly available; performance not great but we do local caching to help

  • Table Files

    • Pax - hybrid columnar storage (paper on it)

    • Each column stored separately with offsets to each file; we do various type of compression between the files; for numbers - bayesian coding; strings - try dexterian coding; JSON - fancy try as well

    • 16 MB in size

    • Header - points to each column - fetch part of a file

  • Other data

    • Store results of query on S3 - if I did select * on a 1 billion row table; most databases would break - we push those results to S3 and return that result to the client

    • You can then query on your own results

    • Enabled by metadata and storage

    • Metadata stored in transactional key value store; transactions done at global levels, lock tables, optimizer stats

    • Metadata store is foundationDB

  • Virtual Warehouse

    • Hid what resources really were

  • Worker Nodes

    • Query comes in, compiler determines from metadata which set of partitions belong to which table, it does pruning using zone maps, removes micropartitions, consistent hashes those micropartitions onto the node of a virtual warehouse

    • Expect caches on warehouse to work effectively

    • Resources on the cloud are sorta flakey because you are running multiple workloads on the same machine

    • Need to deal with stragglers - 8 out of 9 pipelines are done but could be one thats not

    • Stealing - consistent hashed that as well; another round of consistent hashing; if one specific node is consistently slow - then node that would still work

  • Execution Engine

    • Columnar - MonetDB

    • Vectorized - pioneer of vectorization for databases; had to vectorize snowflake; selection vectors (dont materalize intermediate results)

    • Push based model - instead of volcano style like Oracle which are iterator based; it became clear that most databases should be written with pushed base; Neumann paper good for push based databases

    • Operate on batches of data; plan moves from down upwards; could be DAGs and not just trees; Update function - if you decide to modify a single row of a table, need to read entire partition - write everything from that row back; 2 pipelines - all data that needs to be updated in one pipeline vs. data that doesn't in other pipeline; can split these things - DAG-based plans were super helpful (impossible without push based approach)

    • Basically can split up things that do and don't need to be modified - allow more speed

    • No buffer pool - decided that reserving memory for micro-partitions was not super valuable;

  • Self-tuning and Self-Healing

    • Auto memory management - compiler decides how much memory it should get; operator tries to use as much memory as possible in the beginning

      • Take all the resources as possible to finish a query

      • Operators can grow and shrink their memory

    • Parallelism chosen based on what we see

      • At run-time make adaptive decisions because query optimizer could be wrong

    • Join order is set by planner

  • Automatic Skew Avoidance

    • We gather data and see what is popular; we are always on in terms of being adaptive

  • Cloud Service

    • We have high availability; cloud service turns out to be harder to do than the database

    • Cloud service needs to be online 24/7 and online upgrades - people dancing on the dance floor and you need to upgrade the dance floor every week and the dancers can't know that you are doing it

  • Concurrency Control

    • Transactions done via metadata

    • Implement snapshot isolation

    • Immutable S3 files make life easier - keep track of set of tables that belong to a table (DML operation?)

    • Rather than have an undo log - just keep track of a set of file operations

    • Allows us to do time travel - says what the files are one day ago; 30 days ago, etc. - all you are incurring are extra storage costs

      • People use it to see changes in table;

      • Use it for making copies of data - I want to make a clone of data as of 7 days ago

      • Don't version schema the same way though

  • Cloning

    • Cloning an object is easier; make a copy of metadata

  • Pruning

    • Key contribution to product - zone maps are part of metadata

    • Can run a number of queries that can be done from metadata

    • Compiler looks at zone maps

    • We implemented a lot of operators - almost every operator works with zone map - min and max and distinct values

    • Zone maps are in-memory - they were originally in foundationDB but now in S3

    • Only the brain (cloud services) is written in java - metadata, compiler, access control, security, transactions; muscle is virtual warehouses that are executing the plan

    • Muscle in C++

    • To build a Restful service there are a lot more libraries available to use with Java; there are enough cloud services written C++ only; if we redid this today - we would not use java today

    • The bulk of time is spent in c++ code for our queries; 1 minute or 2-3 seconds - all the work done by C++ layer; java layer is hopefully hundreds of miliseconds

    • Number of zones is pre-determined - we determine size of partition and then we have that

    • Maintained some bloom filters but they got too large over time;

    • No indexes at all - not uncommon - a billion rows you can have an index; one trillion rows it becomes a lot harder to deal with

  • SaaS experience

    • Built a web UI; much easier to do things in UI - billing, access control, utilization

    • We did a lot of sampling - got metrics on operators, queries; people from support can help customers with this data

    • Support mySQL, Netezza, Teradata language for easy porting; wrote our own SQL - used Antler; our grammar was frankenstein grammar (a bit of everything)

  • Continuously Availability

    • Points of failure are external services - US West (1,000 brains, 5,000 workers with multi-processors) - a failure of a single one shouldn't bring

  • Semi-Structured data

    • Realized we needed

    • Huge differentiator - we have really good JSON support - VARIANT is JSON

    • Can do joins and group bys on it; flatten it - turn it into individual rows;

  • Schema-Less

    • Bigquery/dremel / impala needed schema upfront

    • Snowflake can do automatic type inference and columnar storage for schema-less data

  • Security

    • Encrypt data at rest; had to prove to them that we'd be way more secure than they'd ever be

    • Fine grain access could be granted to users

    • Key management is hierarchical; we can't read the data without their key

  • Post Sigmod 16 features

    • Data sharing - uses metadata and access controls and it just sort of works

    • Ingestion service - doesn't require compute upfront - put data in S3 and we will just ingest it

  • Lessons learned

    • Controversial to build a Database that wasn't hadoop based in 2012

    • Users love no-tuning aspect

    • Core performance not as important as ease of use, easy ingestion

  • Challenges

    • Multi-tenancy, multi-cloud, replication, data lake support, materialized views

    • Performance

  • Q: Anything that you regret doing? Engineering time sinks?

    • Cloud Services has been super painful - when you have to compile queries - it does a fair amount of work and it takes a lot of resources - provisioning for those resources and separating for those resources is a lot harder

    • Really hard to tell how much memory

    • Cloud service is multi-tenant - java doesn't have a lot of insight into query memory usage