Snowflake

#databases #big-data

Snowflake is a fully-managed data platform integrating with cloud vendor storage and compute in Azure, AWS, and Google Cloud. It enables you to create a data warehouse and data lake, and allows you to process and share data. It consists of two core services, virtual warehouses and Snowpipe, and tools like a web portal, a CLI called SnowSQL, various connectors and drivers, and integrations with BI solutions like Salesforce, Tableau, etc…

Databases
Shares
Data Marketplace
Warehouses
Worksheets

Billing

Billing is based on storage (flat rate per terrabyte, additional charges for continuous data protection, time-travel and fail-safe features), compute (based on the number and size of virtual warehouses, serverless features, Snowpipe, background table maintenance, DB replication, and cloud services like auth, infrastructure and query parsing/optimization) and data transfer (very common for cloud providers).

Standard Edition

SQL data warehouse engine
Encryption in transit and at rest
All system tools
1 day of time travel

Enterprise Edition

Standard plus:

Materialized views
Data masking
Point-lookup optimization
Multi-cluster warehouses
Periodic rekey of encryption
Up to 90 days of time travel

Business Critical

Enterprise plus:

HIPAA/PCI support
Customer-managed keys
AWS and Azure private links within VPCs
Disaster recovery failover and failback

Virtual Private Snowflake

Business critial on dedicated hardware, super secure.

Snowflake architecture
- Massively parallel processing database engine
  - Cluster of nodes that work together to solve a query
- Decoupled compute and storage
  - Cluster of compute nodes are separate from storage so each can scale separately, unlike traditional RDMS solutions
- Integrated with cloud storage
  - Read from cloud object storage like AWS S3, Google Cloud Storage, etc…
  - How does this differ from Parquet?
- Can query files directly vs. importing data into the native Snowflake table format
  - Used for ad-hoc analysis
  - Can also define an External Table by defining a schema over a set of files, stores a list of files in Snowflake and this may need to be refreshed
  - Can put direct file and external table queries into views
  - Not all functions are available for this
Virtual warehouse
- Main compute resources that execute queries and commands (selects, inserts, updates, deletes, merges, and copies)
- DDL does not always require a running virtual warehouse
- Shut off automatically after a certain amount of inactivity
- Scales as needed
- Cost-efficiency of elasticity
- Supports multiple virtual warehouses so they don’t have to compete for resources on the same datasets (different parts of the company can scale as needed)
- Seems all about independence/decoupling of multiple layers of storage and compute
Databases -> schemas -> tables, like PostgreSQL
Tables
- Micro-partitions
  - Automatic partitioning scheme
  - 50-500MB
  - Compressed columnar storage
  - Pruned during query execution
- Clustering key (optional)
  - Aimed at 1TB+ tables
  - Orders micro-partition records based on key
  - Automatically maintained
  - Useful for ranges and equality predicates on the key
- Indexes?
  - Search optimization service
  - Serverless feature in Enterprise+
  - Table-level property
  - Applies to equality predicates
- Constraints
  - Primary key
  - Unique key
  - Foreign key
  - Not null
    - Other constraints are NOT enforced, just for documentation
- Data types
  - Booleans
  - Numeric (number, integer, float)
  - String/Binary
    - Char
    - Varchar
  - Date/Time
  - Semi-structured
    - Variant
    - Array
    - Object (key-value pairs)
  - Spatial data for geographical data
- Views
  - Normal
  - Secure: when data under view must be protected and not expanded in the query
  - Materialized: Snowflake saves the results permanently and kept up-to-date
- Code modules
  - User-defined functions
  - External functions: Create functions in DB schema that hits remote APIs
  - Stored procedures
  - Can be coded in SQL or JavaScript
- Select at timestamp for time travel
  - Insert from select to restore previous state
  - Can do predicates on private/secure fields outside of the current view (secure views)
- Profiling shows expensive operations (nodes) in queries
- CREATE OR REPLACE FUNCTION ... RETURNS TABLE(...) AS $$ ... $$;
- SELECT get_ddl(); shows create statement
SnowSQL CLI tool
- Can source files
- Define variables
- Replace
- Spool into files
- Result cache
Tons of import file formats
Will automatically compress and encrypt
Import best practices
- Dedicated small loading virtual warehouse
- 10-100MB compressed file
  - Aggregate or split data to get here
- Separate files by folders with source and date information
Snowpipe: continuous loading of files
- Works with cloud storage to automatically handle new/updated files
Many other third-party ETL solutions
Data stages
- Internal (User stage, table stage, named stages)
- External (S3, Azure Block Storage, Google Cloud Storage)
- First, select a stage
  - Then, put the file in the stage
  - Copy into command into table
  - Purge the file

Security Model

Access roles
- AccountAdmin
- SecurityAdmin
- SysAdmin
- UserAdmin
- Public
- Custom roles
Role activation (assume higher privileged role temporarily when you need it)
Can individually secure warehouses, users, integrations, databases, schemas, tables

Snowflake stores data in a compressed columnar format. We can assume each column is an independent file (or files). Filter only the columns needed, and then fetch the rest of the data for those filtered rows.