Snowflake
Snowflake is a fully-managed data platform integrating with cloud vendor storage and compute in Azure, AWS, and Google Cloud. It enables you to create a data warehouse and data lake, and allows you to process and share data. It consists of two core services, virtual warehouses and Snowpipe, and tools like a web portal, a CLI called SnowSQL, various connectors and drivers, and integrations with BI solutions like Salesforce, Tableau, etc…
- Databases
- Shares
- Data Marketplace
- Warehouses
- Worksheets
Billing
Billing is based on storage (flat rate per terrabyte, additional charges for continuous data protection, time-travel and fail-safe features), compute (based on the number and size of virtual warehouses, serverless features, Snowpipe, background table maintenance, DB replication, and cloud services like auth, infrastructure and query parsing/optimization) and data transfer (very common for cloud providers).
Standard Edition
- SQL data warehouse engine
- Encryption in transit and at rest
- All system tools
- 1 day of time travel
Enterprise Edition
Standard plus:
- Materialized views
- Data masking
- Point-lookup optimization
- Multi-cluster warehouses
- Periodic rekey of encryption
- Up to 90 days of time travel
Business Critical
Enterprise plus:
- HIPAA/PCI support
- Customer-managed keys
- AWS and Azure private links within VPCs
- Disaster recovery failover and failback
Virtual Private Snowflake
Business critial on dedicated hardware, super secure.
- Snowflake architecture
- Massively parallel processing database engine
- Cluster of nodes that work together to solve a query
- Decoupled compute and storage
- Cluster of compute nodes are separate from storage so each can scale separately, unlike traditional RDMS solutions
- Integrated with cloud storage
- Read from cloud object storage like AWS S3, Google Cloud Storage, etc…
- How does this differ from Parquet?
- Can query files directly vs. importing data into the native Snowflake table format
- Used for ad-hoc analysis
- Can also define an External Table by defining a schema over a set of files, stores a list of files in Snowflake and this may need to be refreshed
- Can put direct file and external table queries into views
- Not all functions are available for this
- Massively parallel processing database engine
- Virtual warehouse
- Main compute resources that execute queries and commands (selects, inserts, updates, deletes, merges, and copies)
- DDL does not always require a running virtual warehouse
- Shut off automatically after a certain amount of inactivity
- Scales as needed
- Cost-efficiency of elasticity
- Supports multiple virtual warehouses so they don’t have to compete for resources on the same datasets (different parts of the company can scale as needed)
- Seems all about independence/decoupling of multiple layers of storage and compute
- Databases -> schemas -> tables, like PostgreSQL
- Tables
- Micro-partitions
- Automatic partitioning scheme
- 50-500MB
- Compressed columnar storage
- Pruned during query execution
- Clustering key (optional)
- Aimed at 1TB+ tables
- Orders micro-partition records based on key
- Automatically maintained
- Useful for ranges and equality predicates on the key
- Indexes?
- Search optimization service
- Serverless feature in Enterprise+
- Table-level property
- Applies to equality predicates
- Constraints
- Primary key
- Unique key
- Foreign key
- Not null
- Other constraints are NOT enforced, just for documentation
- Data types
- Booleans
- Numeric (number, integer, float)
- String/Binary
- Char
- Varchar
- Date/Time
- Semi-structured
- Variant
- Array
- Object (key-value pairs)
- Spatial data for geographical data
- Views
- Normal
- Secure: when data under view must be protected and not expanded in the query
- Materialized: Snowflake saves the results permanently and kept up-to-date
- Code modules
- User-defined functions
- External functions: Create functions in DB schema that hits remote APIs
- Stored procedures
- Can be coded in SQL or JavaScript
- Select
at
timestamp for time travel- Insert from select to restore previous state
- Can do predicates on private/secure fields outside of the current view (secure views)
- Profiling shows expensive operations (nodes) in queries
CREATE OR REPLACE FUNCTION ... RETURNS TABLE(...) AS $$ ... $$;
SELECT get_ddl();
shows create statement
- Micro-partitions
- SnowSQL CLI tool
- Can source files
- Define variables
- Replace
- Spool into files
- Result cache
- Tons of import file formats
- Will automatically compress and encrypt
- Import best practices
- Dedicated small loading virtual warehouse
- 10-100MB compressed file
- Aggregate or split data to get here
- Separate files by folders with source and date information
- Snowpipe: continuous loading of files
- Works with cloud storage to automatically handle new/updated files
- Many other third-party ETL solutions
- Data stages
- Internal (User stage, table stage, named stages)
- External (S3, Azure Block Storage, Google Cloud Storage)
- First, select a stage
- Then, put the file in the stage
- Copy into command into table
- Purge the file
Security Model
- Access roles
- AccountAdmin
- SecurityAdmin
- SysAdmin
- UserAdmin
- Public
- Custom roles
- Role activation (assume higher privileged role temporarily when you need it)
- Can individually secure warehouses, users, integrations, databases, schemas, tables
Snowflake stores data in a compressed columnar format. We can assume each column is an independent file (or files). Filter only the columns needed, and then fetch the rest of the data for those filtered rows.