Vous êtes sur la page 1sur 63

WHITE PAPER

Best Practices for Using


Tableau with Snowflake
BY ALAN ELDRIDGE, ET. AL.
What’s inside:
3 Introduction

4 What Is Tableau?

5 What Is Snowflake?

8 What you don’t have to worry about with Snowflake

10 Creating efficient Tableau workbooks

11 Connecting to Snowflake

20 Working with semi-structured data

26 Working with Snowflake Time Travel

28 Working with Snowflake Data Sharing

29 Implementing role-based security

35 Using custom aggregations

36 Scaling Snowflake warehouses

41 Caching

46 Other performance considerations

49 Measuring performance

63 Find out more


Introduction
It is a widely recognized truth today that data is an increasingly valuable asset for organizations.
Greater and greater volumes of data are being generated and captured across systems at
increasing levels of granularity. At the same time, more users are demanding access to this data
to answer questions and gain business insight.

Snowflake and Tableau are leading technologies addressing these two challenges—Snowflake
providing a near-limitless platform for cloud-built data warehousing; and Tableau providing a
highly intuitive, self-service data analytics platform.

The objective of this whitepaper is to help you make best use of features from both of these
highly complementary products. It is intended for Tableau users who are new to Snowflake,
Snowflake users who are new to Tableau, and of course any users who are new to both.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 3


What is Tableau?
Tableau Software is a business intelligence solution that integrates data analysis and reports into a
continuous visual analysis process, one that lets everyday business users quickly explore data and
shift views on the fly to follow their train of thought. Tableau combines data exploration, visualization,
reporting and dashboarding into an application that is easy to learn and use.

Tableau’s solution set consists of


INTERACT three main products:
Put your data anywhere you need it. • Tableau Desktop­—the end-user tool for data
With Tableau, you can securely consume your data via browser,
desktop, mobile, or embedded into any application. analysis and dashboard building. It can be
used on its own or in conjunction with Tableau
Server/Tableau Online.
• Tableau Server—the server platform
Desktop Browser Mobile Embedded
that provides services for collaboration,
governance, administration and content
COLLABORATION sharing. This can be deployed in-house or
in the cloud (on Amazon’s AWS, Microsoft’s
SECURITY & COMPLIANCE

Azure or Google’s GCP).


EXTENSIBILITY & APIS

ANALYTICS • Tableau Online—a software-as-a-service


version of Tableau Server.

CONTENT DISCOVERY Either working standalone with Tableau


Desktop, or by publishing content to Tableau
Server or Tableau Online, users can directly work
GOVERNANCE with data stored in Snowflake data warehouses.

DATA PREP

DATA ACCESS

DEPLOYMENT
Deploy where and how you want.
Tableau can integrate into your existing data infrastructure,
whether on-prem or in the cloud.

On-Premises Cloud Hosted Windows Linux Mac Multi-Tenant

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 4


What is Snowflake?
Snowflake is an analytic data warehouse provided as software-as-a-service (SaaS). Snowflake
provides a data warehouse that is faster, easier to use and far more flexible than traditional data
warehouse offerings.

Snowflake’s data warehouse is not built on an existing database or big-data software platform such as
Hadoop. The Snowflake data warehouse uses a new SQL database engine with a unique architecture
designed for the cloud. To the user, Snowflake has many similarities to other enterprise data
warehouses, but also has additional functionality and unique capabilities.

DATA WAREHOUSE AS A CLOUD SERVICE


Snowflake’s data warehouse is a true SaaS offering. More specifically:
• There is no hardware (virtual or physical) for you to select, install, configure or manage.

• There is no software for you to install, configure or manage.

• Ongoing maintenance, management and tuning are handled by Snowflake.

Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than
an optional command line client) run in a secure public cloud infrastructure.

Snowflake was originally built on the Amazon Web Services (AWS) cloud infrastructure. Snowflake
uses virtual compute instances provided by AWS EC2 (Elastic Compute Cloud) for its compute
needs and AWS S3 (Simple Storage Service) for persistent storage of data. In addition, as Snowflake
continues to serve enterprises of all sizes across all industries, more and more customers consider
AWS and other cloud infrastructure providers when moving to the cloud. Snowflake has responded,
and is now also available on Microsoft Azure.

Snowflake is not a packaged software offering that can be installed by a user. Snowflake manages all
aspects of software installation and updates. Snowflake cannot be run on private cloud infrastructures
(on-premises or hosted).

SNOWFLAKE ARCHITECTURE
Snowflake’s architecture is a hybrid of traditional shared-disk database architectures and shared-
nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data
repository for persisted data that is accessible from all compute nodes in the data warehouse. But
similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel
processing) compute clusters where each node in the cluster stores a portion of the entire data set
locally. This approach offers the data management simplicity of a shared-disk architecture, but with the
performance and scale-out benefits of a shared-nothing architecture.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 5


Snowflake’s unique architecture consists of three key layers:
• Cloud Services

• Query Processing

• Database Storage
AWS VPC

CLOUD Authentication & Access Control

SERVICES Infrastructure Manager Optimizer Metadata Manager Security

QUERY Virtual Virtual Virtual


PROCESSING warehouse warehouse warehouse

DATABASE
STORAGE

DATABASE STORAGE
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized,
compressed, columnar format. Snowflake stores this optimized data using Amazon Web Service’s S3
(Simple Storage Service) cloud storage or Azure Blob Storage.

Snowflake manages all aspects of how this data is stored in AWS or Azure—the organization, file size,
structure, compression, metadata, statistics and other aspects of data storage. The data objects stored
by Snowflake are not directly visible or accessible by customers; they are accessible only through SQL
query operations run using Snowflake.

QUERY PROCESSING
Query execution is performed in the processing layer. Snowflake processes queries using “virtual
warehouses.” Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes
allocated by Snowflake from Amazon EC2 or Azure Virtual Machines.

Each virtual warehouse is an independent compute cluster that does not share compute resources
with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance
of other virtual warehouses.

For more information, see virtual warehouses in the Snowflake online documentation.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 6


CLOUD SERVICES
The cloud services layer is a collection of services that coordinate activities across Snowflake. These
services tie together all of the different components of Snowflake in order to process user requests,
from login to query dispatch. The cloud services layer also runs on compute instances provisioned
by Snowflake.

Among the services in this layer:


• Authentication

• Infrastructure management

• Metadata management

• Query parsing and optimization

• Access control

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 7


What you don’t have to worry
about with Snowflake
Because Snowflake is a cloud-built, data warehouse as a service, there are lots of things you don’t
need to worry about compared to a traditional on-premises solution:
• Installing, provisioning and maintaining hardware and software

–– Snowflake is a cloud-built data warehouse as a service. All you need to do is create an account
and load some data. You can then just connect from Tableau and start querying. We even provide
free usage to help you get started if you are new to Snowflake!

• Working out the capacity of your data warehouse

–– Snowflake is a fully elastic platform, so it can scale to handle all of your data and all of your users.
You can adjust the size of your warehouses (the layer that does the query execution) up and down
and on the fly to handle peaks and lulls in your data usage. You can even turn your warehouses
completely off to save money when you are not using them.

• Learning new tools and a new query language

–– Snowflake is a fully SQL-compliant data warehouse so all the skills and tools you already have
(such as Tableau) will easily connect. We provide connectors for ODBC, JDBC, Python, Spark and
Node.js as well as web-based and command-line interfaces.
–– Business users increasingly need to work with both traditionally structured data (e.g., data in
VARCHAR, INT and DATE columns in tables) as well as semi-structured data in formats such as
XML, JSON and Parquet. Snowflake provides a special data type called a VARIANT that allows
you to simply load your semi-structured data and then query it the same way you would query
traditionally structured data—via SQL.

• Optimizing and maintaining your data

–– Snowflake is a highly-scalable, columnar data platform that allows users to run analytic queries
quickly and easily. It does not require you to manage how your data is indexed or distributed
across partitions. In Snowflake, this is all transparently managed by the platform.
–– Snowflake also provides inherent data protection capabilities, so you don’t need to worry about
snapshots, backups or other administrative tasks such as running VACUUM jobs.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 8


• Securing your data

–– Security is critical for all cloud-based systems, and Snowflake has you covered on this front. All
data is transparently encrypted when it is loaded into Snowflake, and it is kept encrypted at all
times when at rest and in transit.
–– If your business requirements include working with data that requires HIPAA, PII or PCI
compliance, Snowflake can also support these validations with the Enterprise for Secure
Data edition.

• Sharing data

–– Snowflake has revolutionized how organizations distribute and consume shared data. Snowflake’s
unique architecture enables live data sharing between Snowflake accounts without copying and
moving data sets. Data providers enable secure data shares to their data consumers, who can
view and seamlessly combine it with their own data sources. When a data provider adds to or
updates data, consumers always see the most current version.

All of these features free up your time and allows you to focus your attention on using Tableau to view
and understand your data.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 9


Creating efficient
Tableau workbooks
Not every recommendation in this whitepaper is unique to using Tableau with Snowflake. There are
many best practice approaches for creating efficient workbooks that are common across all data
sources, and a great resource for this information is the “Best Practices for Designing Efficient Tableau
Workbooks” whitepaper.

That whitepaper contains too much to repeat here, but the key points are:
• Keep it simple—The majority of performance problems are caused by inefficient workbook design
(e.g., too many charts on one dashboard or trying to show too much data at once). Allow your users
to incrementally drill down to details, rather than trying to show everything then filter.
• Less is more—Work with only the data you need. The fewer rows and columns you work with, the
faster your queries will execute. Also, display only the data you need. The fewer marks you draw,
the faster your workbooks will render.
• Don’t try to do it all yourself—The query generator in Tableau is one of the most efficient on the
market, so trust it to create the queries for you. Certainly, you should understand how things in
a Tableau worksheet impact the queries generated, but the less you fight against the tool (e.g.,
through custom SQL) the better your experience will be.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 10


Connecting to Snowflake
USE THE RIGHT CONNECTOR
To connect to Snowflake, Tableau uses the Snowflake-provided ODBC driver. To use this, you should
use the first-class connector option rather than the generic ODBC option. This will ensure that
Tableau generates SQL that has been optimized for running on Snowflake.

It also provides an interface for selecting the virtual warehouse, database and schema to be used
rather than having this defined statically in the ODBC DSN.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 11


LIVE VS. EXTRACTED CONNECTIONS
Because Snowflake is a high-performance, analytic data warehouse, it allows users to run live
connections over very large data volumes and still maintain acceptable query response times by
selecting an appropriately sized warehouse. With Snowflake there is less need to use Tableau extracts
to improve query performance than on other traditional data platforms.

Tableau extracts can be complementary to Snowflake when:


• Users require an offline data cache that can be used without a connection to Snowflake; or

• Users create aggregated extracts to act as summarized data caches. This was initially proposed by
Jeff Feng in a TC15 presentation (Cold, Warm, & Hot Data: Applying a Three-Tiered Hadoop Data
Strategy) as an approach to working with large, slow data lakes. However, because Snowflake can
handle large volumes of structured and semi-structured data while still providing analytic query
performance, you could use it to provide both the cold and warm layers.

CUSTOM SQL
Sometimes users new to Tableau try to bring old-world techniques to their workbooks such as
creating data sources using hand-written SQL statements. In many cases this is counterproductive
as Tableau can generate much more efficient queries when we simply define the join relationships
between the tables and let the query engine write SQL specific to the view being created. However,
there are times when specifying joins in the data connection window does not offer all the flexibility
you need to define the relationships in your data.

Creating a data connection using a hand-written SQL statement can be very powerful. But in some
cases, custom SQL can actually reduce performance. This is because in contrast to defining joins,
custom SQL is never deconstructed and is always executed atomically. This means no join culling
occurs, and we can end up with situations where the database is asked to process the whole query,
possibly multiple times.

Example
Consider the following, very simple Tableau worksheet. It is showing the number of records in the
TPCH SF1 sample schema for each month, broken down by the market segment:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 12


If our underlying data model uses the recommended approach of linking tables:

The Tableau query engine produces the following optimal query that joins only the tables needed and
returns just the columns we are displaying:

SELECT "Customers"."C_MKTSEGMENT" AS "C_MKTSEGMENT",


COUNT(DISTINCT "Orders"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",
DATE_TRUNC('MONTH',"Orders"."O_ORDERDATE") AS "tmn:O_ORDERDATE:ok"
FROM "TPCH1"."LINEITEM" "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
GROUP BY 1,
3

This results in the following optimal query plan:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 13


What happens though if we simulate an example of embedding business logic in the data model?
Using custom SQL, there are two approaches we can take. The preferred approach is to isolate the
custom SQL to just the part of the model that is affected and keep the rest of the schema as join
definitions. This is what the following example does simply by replacing the LINEITEM table with a
custom SQL statement:

In this case, the following query is generated by Tableau (the custom SQL is highlighted for
easy identification):

SELECT "Customers"."C_MKTSEGMENT" AS "C_MKTSEGMENT",


COUNT(DISTINCT "Orders"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",
DATE_TRUNC('MONTH',"Orders"."O_ORDERDATE") AS "tmn:O_ORDERDATE:ok"
FROM (
SELECT *
FROM TPCH1.LINEITEM
) "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
GROUP BY 1,
3

(Note: it is important that you don’t include a terminating semi-colon in your custom SQL as this will
cause an error.)

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 14


As you can see, the custom SQL is not decomposed but because its scope is just for the LINEITEM
table, Tableau can cull (eliminate) the joins to the unneeded tables. The Snowflake optimizer then
manages to parse this into an optimal query plan, identical to the initial example:

The worst-case result happens when we encapsulate the entire data model (with all table joins) into
a single, monolithic custom SQL statement. This is often what users who are new to Tableau will do,
especially if they have an existing query statement they have used in another tool:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 15


Again, the custom SQL is not decomposed so all the Tableau query engine does is wrap the custom
SQL in a surrounding SELECT statement. This means there is no join culling and Snowflake is required
to join all the tables together before the required subset of data is selected (again, the custom SQL is
highlighted for readability):

SELECT "Bad Custom SQL"."C_MKTSEGMENT" AS "C_MKTSEGMENT",


COUNT(DISTINCT "Bad Custom SQL"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",
DATE_TRUNC('MONTH',"Bad Custom SQL"."O_ORDERDATE") AS "tmn:O_ORDERDATE:ok"
FROM (
SELECT *
FROM "TPCH1"."LINEITEM" "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
INNER JOIN "TPCH1"."PART" "Parts" ON ("LineItem"."L_PARTKEY" = "Parts"."P_PARTKEY")
INNER JOIN "TPCH1"."PARTSUPP" "PARTSUPP" ON ("Parts"."P_PARTKEY" = "PARTSUPP"."PS_PARTKEY")
INNER JOIN "TPCH1"."SUPPLIER" "Suppliers" ON ("PARTSUPP"."PS_SUPPKEY" = "Suppliers"."S_SUPPKEY")
INNER JOIN "TPCH1"."NATION" "CustNation" ON ("Customers"."C_NATIONKEY" = "CustNation"."N_
NATIONKEY")
INNER JOIN "TPCH1"."REGION" "CustRegion" ON ("CustNation"."N_REGIONKEY" = "CustRegion"."R_
REGIONKEY")
) "Bad Custom SQL"
GROUP BY 1,
3

As the following query plan shows, the Snowflake optimizer can parse this only so far, and a less
efficient query is performed:

This last approach is even more inefficient as this entire query plan needs to be run for potentially every
query in the dashboard. In this example, there is only one. But if there were multiple data driven zones
(e.g., multiple charts, filters, legends, etc.) then each could potentially run this entire query wrapped in
its own select clause. If this were the case, it would be preferable to follow the Tableau recommended
practices and create the data connection as a Tableau extract (data volumes permitting). This would require
the complex query to run only once and then all subsequent queries would run against the extract.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 16


INITIAL SQL
As just discussed, one of the limitations of custom SQL is that it can be run multiple times for a single
dashboard. One way to avoid this is to use initial SQL to create a temporary table which will then be
the selected table in your query. Because initial SQL is executed only once when the workbook is
opened (as opposed to every time the visualization is changed for custom SQL), this could significantly
improve performance in some cases. However, the flip-side is that the data populated into the temp
table will be static for the duration of the session, even if the data in the underlying tables changes.

Example
Using the example above, instead of placing the entire query in a custom SQL statement, we could use
it in an initial SQL block and instantiate a temporary table:

CREATE OR REPLACE TEMP TABLE TPCH1.FOO AS


SELECT *
FROM "TPCH1"."LINEITEM" "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
INNER JOIN "TPCH1"."PART" "Parts" ON ("LineItem"."L_PARTKEY" = "Parts"."P_PARTKEY")
INNER JOIN "TPCH1"."PARTSUPP" "PARTSUPP" ON ("Parts"."P_PARTKEY" = "PARTSUPP"."PS_PARTKEY")
INNER JOIN "TPCH1"."SUPPLIER" "Suppliers" ON ("PARTSUPP"."PS_SUPPKEY" = "Suppliers"."S_SUPPKEY")
INNER JOIN "TPCH1"."NATION" "CustNation" ON ("Customers"."C_NATIONKEY" = "CustNation"."N_
NATIONKEY")
INNER JOIN "TPCH1"."REGION" "CustRegion" ON ("CustNation"."N_REGIONKEY" = "CustRegion"."R_
REGIONKEY");

We then simply select the FOO table as our data source:

And Tableau generates the following query:

SELECT "FOO"."C_MKTSEGMENT" AS "C_MKTSEGMENT",


COUNT(DISTINCT "FOO"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",
DATE_TRUNC('MONTH',"FOO"."O_ORDERDATE") AS "tmn:O_ORDERDATE:ok"
FROM "TPCH1"."FOO" "FOO"
GROUP BY 1,
3

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 17


This has a very simple query plan and fast execution time as the hard work has been done in the
creation of the temp table. However, the data returned will not reflect changes to the underlying fact
tables until we start a new session and the temp table is recreated.

Note that on Tableau Server an administrator can set a restriction on initial SQL so that it will not
run. You might need to check to see that this is okay in your environment if you are planning to
publish your workbook to share with others. Also, note that temp tables take up additional space
in Snowflake that will contribute to the account storage charges, but they are ephemeral, so this is
generally not significant.

VIEWS
Views are another alternative to using custom SQL. Instead of materializing the result into a temp
table, which is potentially a time-consuming action to perform when the workbook is initially opened,
you could define a view:

CREATE OR REPLACE VIEW FOO_VW AS


SELECT *
FROM "TPCH1"."LINEITEM" "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
INNER JOIN "TPCH1"."PART" "Parts" ON ("LineItem"."L_PARTKEY" = "Parts"."P_PARTKEY")
INNER JOIN "TPCH1"."PARTSUPP" "PARTSUPP" ON ("Parts"."P_PARTKEY" = "PARTSUPP"."PS_PARTKEY")
INNER JOIN "TPCH1"."SUPPLIER" "Suppliers" ON ("PARTSUPP"."PS_SUPPKEY" = "Suppliers"."S_SUPPKEY")
INNER JOIN "TPCH1"."NATION" "CustNation" ON ("Customers"."C_NATIONKEY" = "CustNation"."N_
NATIONKEY")
INNER JOIN "TPCH1"."REGION" "CustRegion" ON ("CustNation"."N_REGIONKEY" = "CustRegion"."R_
REGIONKEY")

In Tableau, we then use the view as our data source:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 18


The query generated by Tableau is simple, as it is with the temp table approach:

SELECT "FOO_VW"."C_MKTSEGMENT" AS "C_MKTSEGMENT",


COUNT(DISTINCT "FOO_VW"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",
DATE_TRUNC('MONTH',"FOO_VW"."O_ORDERDATE") AS "tmn:O_ORDERDATE:ok"
FROM "TPCH1"."FOO_VW" "FOO_VW"
GROUP BY 1,
3

However, this takes longer to run and has a less efficient query plan as the view needs to be evaluated
at query time. The benefit is that you will always see up-to-date data in your results:

All of the above approaches are valid but you should select the most appropriate solution based on
your particular needs of performance, data freshness, maintainability and reusability.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 19


Working with
semi-structured data
Today, business users find themselves working with data in multiple forms from numerous sources.
This includes an ever-expanding amount of machine-generated data from applications, sensors, mobile
devices, etc. Increasingly, these data are provided in semi-structured data formats such as JSON,
Avro, ORC, Parquet and XML that have flexible schemas. These semi-structured data formats do not
conform to the standards of traditionally structured data, but instead contain tags or other types of
mark-up that identify individual, distinct elements within the data:

{ "city": {
"coord": { "lat": -37.813999, "lon": 144.963318 },
"country": "AU",
"findname": "MELBOURNE",
"id": 2158177,
"name": "Melbourne",
"zoom": 5 },
"clouds": {
"all": 88 },
"main": {
"humidity": 31, "pressure": 1010, "temp": 303.4,
"temp_max": 305.15, "temp_min": 301.15 },
"rain": {
"3h": 0.57 },
"time": 1514426634,
"weather": [ {
"description": "light rain", "icon": "10d",
"id": 500, "main": "Rain" } ],
"wind": {
"deg": 350, "speed": 4.1 }
}

Two of the key attributes that distinguish semi-structured data from structured data are nested data
structures and the lack of a fixed schema:
• Unlike structured data, which represents data as a flat table, semi-structured data can contain
n-level hierarchies of nested information.
• Structured data requires a fixed schema that is defined before the data can be loaded and queried
in a relational database system. Semi-structured data does not require a prior definition of a schema
and can constantly evolve; i.e., new attributes can be added at any time.

Tableau has introduced support for directly reading JSON data files. However, broader support for
semi-structured data can be achieved by first loading the data into Snowflake. Snowflake provides
native support for semi-structured data, including:
• Flexible-schema data types for loading semi-structured data without transformation

• Direct ingestion of JSON, Avro, ORC, Parquet and XML file formats

• Automatic conversion of data to Snowflake’s optimized internal storage format

• Database optimization for fast and efficient querying of semi-structured data

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 20


THE VARIANT DATA TYPE
Rather than requiring semi-structured data to be parsed and transformed into a traditional schema
of single-value columns, Snowflake stores semi-structured data in a single column of a special type—
VARIANT. Each VARIANT column can contain an entire semi-structured object consisting of multiple
key-value pairs.

SELECT * FROM SNOWFLAKE_SAMPLE_DATA.WEATHER.WEATHER_14_TOTAL


LIMIT 2;

V::VA RIA NT T: :TIME STA MP

{ "city": { "coord": { "lat": 27.716667, "lon": 85.316666 }, "country": "NP", "findname": 1.Sep.2017
"KATHMANDU", "id": 1283240, "name": "Kathmandu", "zoom": 7 }, "clouds": { "all": 75 }, "main": { 04:02:54
"humidity": 65, "pressure": 1009, "temp": 300.15, "temp_max": 300.15, "temp_min": 300.15 },
"time": 1504263774, "weather": [ { "description": "broken clouds", "icon": "04d", "id": 803, "main":
"Clouds" } ], "wind": { "deg": 290, "speed": 2.6 } }
{ "city": { "coord": { "lat": 8.598333, "lon": -71.144997 }, "country": "VE", "findname": "MERIDA", 1.Sep.2017
"id": 3632308, "name": "Merida", "zoom": 8 }, "clouds": { "all": 12 }, "main": { "grnd_level": 819.46, 04:02:54
"humidity": 90, "pressure": 819.46, "sea_level": 1027.57, "temp": 287.006, "temp_max": 287.006,
"temp_min": 287.006 }, "time": 1504263774, "weather": [ { "description": "few clouds", "icon":
"02d", "id": 801, "main": "Clouds" } ], "wind": { "deg": 122.002, "speed": 0.75 } }

The VARIANT type is quite clever—it doesn’t just store the semi-structured object as a text string, but
rather the individual keys and their values are stored in a columnarized format, just like normal columns in
a relational table. This means that storage and query performance for operations on data in a VARIANT
column are very similar to storage and query performance for data in a normal relational column1.

Note that the maximum number of key-value pairs that will be columnarized for a single VARIANT
column is 1000. If your semi-structured data has > 1000 key value pairs you may benefit from
spreading the data across multiple VARIANT columns. Additionally, each VARIANT entry is limited to a
maximum size of 16MB of compressed data.

The individual key-value pairs can be queried directly from VARIANT columns with a minor extension
to traditional SQL syntax:

SELECT V:time::TIMESTAMP TIME,


V:city:name::VARCHAR CITY,
V:city.country::VARCHAR COUNTRY,
(V:main.temp_max - 273.15)::FLOAT AS TEMP_MAX,
(V:main.temp_min - 273.15)::FLOAT AS TEMP_MIN,
V:weather[0].main::VARCHAR AS WEATHER_MAIN
FROM SNOWFLAKE_SAMPLE_DATA.WEATHER.WEATHER_14_TOTAL;

1 Fornon-array data that uses only native JSON types (strings and numbers, not timestamps). Non-native values such as dates and timestamps
are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and consume more space than
when stored in a relational column with the corresponding data type.

TIME CIT Y COUNTRY T E M P_ M A X T E M P_ M I N W E AT H E R M A I N

8-Jan-2018 1:05 am Melbourne AU 20 19 Rain


8-Jan-2018 12:02 am Melbourne AU 19 19 Rain
7-Jan-2018 11:05 pm Melbourne AU 19 18 Clouds

Detailed documentation on dealing with semi-structured data can be found in the Snowflake online documentation here:
https://docs.snowflake.net/manuals/user-guide/semistructured-concepts.html

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 21


ACCESSING SEMI-STRUCTURED DATA FROM TABLEAU
Unfortunately, Tableau does not automatically recognize the VARIANT data type so it doesn’t
automatically create queries with the SQL extensions outlined above. This means we need to manually
create the SQL necessary for accessing the data in these columns.

As outlined earlier in this paper, one way to access semi-structured data is to use custom SQL.
The best practices identified there should be followed in this case—specifically that you don’t use a
monolithic statement that joins across multiple tables. Instead, use a discrete statement to reference
the key-value pairs from the VARIANT column and then join that custom SQL “table” with the other
regular tables. Also remember to set “assume referential integrity,” so the Tableau query generator can
cull tables when they are not required.

Example
To use the WEATHER_14_TOTAL table in the sample WEATHER schema, we can create a custom SQL
“table” in Tableau using the following query:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 22


This data source can then be used to create vizzes and dashboards as if the user were connected to a
traditional structured schema, with similar performance:

In this example, the WEATHER_14_TOTAL table has ~66M records and response times in Tableau
using a X-SMALL warehouse were quite acceptable.

Of course, as outlined earlier, this SQL could also be used in a DB view, or in an initial SQL statement
to create a temp table—the specific approach you use should be dictated by your needs. Alternatively,
as a way to ensure governance and a consistent view of the JSON data, you can always publish the
data source to Tableau Server (or Tableau Online) for reuse across multiple users and workbooks:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 23


PASS-THROUGH SQL
Tableau provides a number of functions that allow us to pass raw SQL fragments through to
the underlying data source for processing. These are often referred to as RAWSQL functions:

http://onlinehelp.tableau.com/current/pro/desktop/en-us/functions_functions_passthrough.html

We can use these to dynamically extract values from a VARIANT field:

This SQL fragment is passed through to Snowflake without alteration. Here is a query from a
calculation based on RAWSQL statements (the fragments are highlighted):

SELECT (V:city.country::string) AS "ID",



AVG(((V:main.temp_max - 273.15)::float)) AS "avg:temp_max:ok"
FROM "WEATHER"."WEATHER_14_TOTAL" "WEATHER_14_TOTAL" GROUP BY 1

The advantage of this approach is that the Tableau author can extract specific VARIANT values
without needing to create an entire custom SQL table statement.

ELT
While there are benefits to keeping your semi-structured data in VARIANT data types, there are
also limitations as outlined above. If your data schema is well defined and static, you may benefit
from performing some ELT (extract-load-transform) transformations on your data to convert it into a
traditional data schema.

There are multiple ways to do this. One is to use SQL statements directly in Snowflake (for more
detail on the supported semi-structured data functions click on this link: https://docs.snowflake.net/
manuals/sql-reference/functions-semistructured.html#semi-structured-data-functions). If you need
to conditionally separate the data into multiple tables you can use Snowflake’s multi-table insert
statement to improve performance: https://docs.snowflake.net/manuals/sql-reference/sql/insert-multi-
table.html.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 24


Another approach is to use third-party ETL/ELT tools from Snowflake partners such as
Informatica, Matillion, Alooma and Alteryx (to name a few). Note that if you are dealing with
large volumes of data, you will want to use tools that can take advantage of in-DB processing.
This means that the data processing will be pushed down into Snowflake. This greatly increases
workflow performance by eliminating the need to transfer massive amounts of data out of
Snowflake to the ETL tool, manipulate it locally and then push it back into Snowflake.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 25


Working with
Snowflake Time Travel
Snowflake Time Travel is a powerful feature that lets users access historical data. This enables many
powerful capabilities but in the case of Tableau it allows users to query data in the past that has since
been updated or deleted.

When any data manipulation operations are performed on a table (e.g., INSERT, UPDATE, DELETE,
etc.), Snowflake retains previous versions of the table data for a defined period of time. This enables
querying earlier versions of the data using an AT or BEFORE clause.

Example
The following query selects historical data from a table as of the date and time represented by the
specified timestamp:

SELECT *
FROM my_table AT(timestamp => 'Mon, 01 May 2015 16:20:00 -0700'::timestamp);

The following query selects historical data from a table as of five minutes ago:

SELECT *
FROM my_table AT(offset => -60*5);

The following query selects historical data from a table up to, but not including any changes made by
the specified statement:

SELECT *
FROM my_table BEFORE(statement => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726');

These features are included standard for all accounts (i.e., no additional licensing is required). However,
standard Time Travel is one day. Extended Time Travel (up to 90 days) requires Snowflake Enterprise
Edition (or higher). In addition, Time Travel requires additional data storage, which has associated fees.

ACCESSING TIME TRAVEL DATA FROM TABLEAU


As with semi-structured data, Tableau does not automatically understand how to create time travel
queries in Snowflake, so we must use custom SQL. Again, we recommend you use best practices by
limiting the scope of your custom SQL to the table(s) affected.

Example
The following Tableau data source shows a copy of the TPCH_SF1 schema, with Time Travel enabled.
We want to be able to query the LineItem table and use the data it contained at an arbitrary point in
the past. To do this we use a custom SQL statement for just the LineItem table. In this case, we set an
AT clause for the LineItem table and use a parameter to pass in the Time Travel time value:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 26


Using this query in the following worksheet:

Results in the following query running in Snowflake (the custom SQL is highlighted):

SELECT "Customers"."C_MKTSEGMENT" AS "C_MKTSEGMENT",


COUNT(DISTINCT "Orders"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",
DATE_TRUNC('MONTH',"Orders"."O_ORDERDATE") AS "tmn:O_ORDERDATE:ok"
FROM (
SELECT *
FROM TPCH1.LINEITEM AT(timestamp => '2018-01-01 12:00:00.967'::TIMESTAMP_NTZ)
) "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
GROUP BY 1,
3

Thanks to the parameter, it is easy for the user to select different as-at times. Again, this SQL could
also be used in a DB view or in an initial SQL statement to create a temp table. The specific approach
you use should be dictated by your needs.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 27


Working with
Snowflake Data Sharing
Snowflake Data Sharing makes it possible to directly share data in real-time and in a secure, governed
and scalable way from the Snowflake cloud data warehouse. Organizations can use it to share data
internally across lines of business, or even externally with customers and partners with almost
no friction or effort. This is an entirely new way of connecting users to data that doesn’t require
transmission, and it significantly reduces the traditional pain points of storage duplication and latency.
Instead of transmitting data, Snowflake Data Sharing enables “data consumers” to directly access read-
only copies of live data that’s in a “data provider’s” account.

The obvious advantage of this for Tableau users is the ability to query data shared by a provider and to
know it is always up-to-date. No ongoing data administration is required by the consumer.

Example
To access shared data, you first view the available inbound shares:

show shares;

You can then create a database from the inbound share and apply appropriate privileges to the
necessary roles:

//Create a database from the share


create or replace database SNOWFLAKE_SAMPLE_DATA
from share SNOWFLAKE.SHARED_SAMPLES;

//Grant permissions to others


grant imported privileges on database SNOWFLAKE_SAMPLE_DATA to role public;

Users can then access the database objects as if they were local. The tables and views appear to
Tableau the same as any other:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 28


Implementing
role-based security
A common requirement when accessing data is to implement role-based security where the data
returned in a Tableau viz is restricted either by row, column or both. There are multiple ways to
achieve this in Tableau and Snowflake, and a key determinant of the right solution is how reusable you
want it to be.

The following sections explain how to implement both Tableau-only and generic solutions.

SETTING UP THE DATA ACCESS RULES


To restrict the data available to a user, we need a table to define records for each user context and the
data elements we wish to allow access.

Example
Using the TPCH_SF1 schema in the sample data, we create a simple REGION_SECURITY table as follows:

R S _ REG IONK E Y RS_USER

1 alan
2 alan
3 alan
4 alan
5 alan
1 clive
2 clive
3 kelly
4 kelly

We link this table via the REGION and NATION tables to the CUSTOMER table:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 29


In Tableau we now see different result sets returned for each of the three RS_USER values:

If we needed to create access rules across multiple dimensions of our data (e.g., in addition to region,
we also want to restrict access by product brand), then we would create multiple access rule tables
and connect them appropriately into our data schema.

PASSING IN THE USER CONTEXT


To make the above model secure, we need to enforce the restriction so that a viewing user can only
see the result set permitted to their user context. I.e. where the RS_USER field is equal to the viewing
user’s username. We need to pass this username from Tableau into Snowflake and the way we do
this depends on whether we are creating a Tableau-only solution or something that is generic and will
work for any data tool.

TABLEAU-ONLY SOLUTION
If we are building a Tableau-only solution we have the ability to define some of our security logic at
the Tableau layer. This can be an advantage in some scenarios, particularly when the viz author does
not have permission to modify the database schema or add new objects (e.g., views), as would be the
case if you were consuming a data share from another account.

Additionally, if the interactor users in Tableau (i.e., users accessing the viz via Tableau Server, as
opposed to the author creating it in Tableau Desktop) do not use individual logins for Snowflake, then
you must create the solution using this approach because the Snowflake variable CURRENT_USER
will be the same for all user sessions.

Example
To enforce the restriction, we create a data source filter in Tableau that restricts the result set to
where the RS_USER field matches the Tableau function USERNAME() which returns the login of the
current user:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 30


The result is only the data permitted for the viewing user. Note, that in the following screenshot there
are no worksheet-level filters as the filter is enforced on the data source:

The USERNAME() function is evaluated in Tableau and then pushed through to Snowflake as a literal,
as you can see in the resulting query:

SELECT 'ALAN' AS "Calculation_3881117741409583110",


"CustNation"."N_NAME" AS "N_NAME"
FROM "TPCH1"."NATION" "CustNation"
INNER JOIN "TPCH1"."REGION" "CustRegion" ON ("CustNation"."N_REGIONKEY" = "CustRegion"."R_
REGIONKEY")
INNER JOIN "TPCH1"."REGION_SECURITY" "REGION_SECURITY" ON ("CustRegion"."R_REGIONKEY" =
"REGION_SECURITY"."RS_REGIONKEY")
WHERE (UPPER("REGION_SECURITY"."RS_USER") = 'ALAN')
GROUP BY 2

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 31


To enforce this filter and prevent a workbook author from editing or removing it, you should publish
the data source to Tableau Server and create the workbook using the published data source.

SOLUTION FOR ANY DATA TOOL


While the above solution is simple to implement it only works for environments where all access is
via Tableau, as the security logic is implemented in the Tableau layer, not the database. To make the
solution truly robust and applicable for users connecting via any tool, we need to push the logic down
into Snowflake. How we do this depends on whether each user has their own Snowflake login or
whether all queries are sent to Snowflake via a common login.

Example
Continuing the example above, rather than using a data source filter in Tableau, we will create a view
in Snowflake:

create or replace secure view "TPCH1"."SECURE_REGION_VW" as


select R_REGIONKEY, R_NAME, R_COMMENT, RS_USER
from "TPCH1"."REGION" "CustRegion"
inner join "TPCH1"."REGION_SECURITY" "REGION_SECURITY"
on ("CustRegion"."R_REGIONKEY" = "REGION_SECURITY"."RS_REGIONKEY")
WHERE (UPPER("REGION_SECURITY"."RS_USER") = UPPER(CURRENT_USER));

In this example we are using the Snowflake variable CURRENT_USER but we could equally be using
CURRENT_ROLE if your solution needed to be more scalable. Clearly this requires each viewing user
to be logging in to Snowflake with their own credentials—you can enforce this when publishing the
workbook to Tableau Server by setting the authentication type to “prompt user”:

If you are using “embedded password” then CURRENT_USER and CURRENT_ROLE will be the same
for all user sessions. In this case we need to pass in the viewing user name via an initial SQL block:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 32


The view would then reference this variable:

create or replace secure view "TPCH1"."SECURE_REGION_VW" as


select R_REGIONKEY, R_NAME, R_COMMENT, RS_USER
from "TPCH1"."REGION" "CustRegion"
inner join "TPCH1"."REGION_SECURITY" "REGION_SECURITY"
on ("CustRegion"."R_REGIONKEY" = "REGION_SECURITY"."RS_REGIONKEY")
WHERE (UPPER("REGION_SECURITY"."RS_USER") = UPPER($VIEWING_USER));

This then enforces the filter on the result set returned to Tableau based on the Tableau user name:

There is one final step required to enforce the security rules specified in the SECURE_REGION_VW
view. We need to enforce referential integrity in the schema. Without this, if you don’t use the
SECURE_REGION_VW in your query, then join culling could drop this table from the query and
security would be bypassed.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 33


If your data permits, you can create constraints between the tables in Snowflake (see https://docs.
snowflake.net/manuals/sql-reference/sql/create-table-constraint.html for details) or you can simply
uncheck “assume referential integrity” in Tableau:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 34


Using custom aggregations
Snowflake provides a number of custom aggregation functions outside the ANSI SQL specification. Some
of these are specific to working with semi-structured data, while others are useful for approximating
results (e.g., cardinality, similarity, frequency, etc.) when you are working over very large data volumes.
A full list of the available aggregation functions can be found in the Snowflake online documentation:

https://docs.snowflake.net/manuals/sql-reference/functions-aggregation.html

To use these functions in Tableau, you can leverage the pass-through functions in Tableau:

http://onlinehelp.tableau.com/current/pro/desktop/en-us/functions_functions_passthrough.html

Example
Snowflake provides a custom aggregation function APPROX_COUNT_DISTINCT which uses
HyperLogLog to return an approximation of the distinct cardinality of a field. To use this function,
we can create a calculated field that leverages the appropriate RAWSQL function. When we use this
calculation in a Tableau viz:

Tableau produces the following query in Snowflake:

SELECT COUNT(DISTINCT "Orders"."O_ORDERKEY") AS "ctd:O_ORDERKEY:ok",


(APPROX_COUNT_DISTINCT("Orders"."O_ORDERKEY")) AS "usr:Calculation_1944218056180731904:ok",
DATE_PART('YEAR',"Orders"."O_ORDERDATE") AS "yr:O_ORDERDATE:ok"
FROM "TPCH_SF1"."LINEITEM" "LineItem"
INNER JOIN "TPCH_SF1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
GROUP BY 3

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 35


Scaling Snowflake warehouses
Snowflake supports two ways to scale warehouses:
• Scale up by resizing a warehouse.

• Scale out by adding clusters to a warehouse (requires Snowflake Enterprise Edition or higher).

WAREHOUSE RESIZING IMPROVES PERFORMANCE


Resizing a warehouse generally improves query performance, particularly for larger, more complex
queries. It can also help reduce the queuing that occurs if a warehouse does not have enough servers
to process all the queries that are submitted concurrently. Note that warehouse resizing is not
intended for handling concurrency issues. Instead, use additional warehouses to handle the workload
or use a multi-cluster warehouse if this feature is available for your account.

The number of servers required to process a query depends on the size and complexity of the query.
For the most part, queries scale linearly with warehouse size, particularly for larger, more complex queries:
• When considering factors that impact query processing, the overall data size of the tables being
queried has more impact than the number of rows.
• Filtering in a query using predicates also has an impact on processing, as does the number of joins/
tables in the query.

Snowflake supports resizing a warehouse at any time, even while running. If a query is running slowly
and you have additional queries of similar size and complexity that you want to run on the same
warehouse, you might choose to resize the warehouse while it is running. However, note the following:
• As stated earlier about warehouse size, larger is not necessarily faster; for smaller, basic queries that
are already executing quickly, you may not see any significant improvement after resizing.
• Resizing a running warehouse does not impact queries that are already being processed by the
warehouse; the additional servers are only used for queued and new queries.

Decreasing the size of a running warehouse removes servers from the warehouse. When the servers
are removed, the cache associated with the servers is dropped, which can impact performance the
same way that suspending the warehouse can impact performance after it is resumed. Keep this in
mind when choosing whether to decrease the size of a running warehouse or keep it at the current
size. In other words, there is a trade-off with saving credits versus maintaining the server cache.

Example
To demonstrate the scale-up capability of Snowflake, the following dashboard was created against
the TPCH-SF10 sample data schema and published to Tableau Server. Note that this dashboard was
intentionally designed to ensure multiple queries would be initiated in the underlying DBMS:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 36


The data source for the workbook was also configured with an initial SQL statement to disable query
result caching within Snowflake via the USE_CACHED_RESULT session parameter. Result caching
(which we cover in more detail later) would make every query after the first one extremely fast
because the query results do not need to be recomputed and are just read from cache in < 1 sec.

TabJolt was then installed on a Windows PC to act as a load generator, invoking the dashboard with a
:refresh=y parameter to prevent caching within Tableau Server. A series of test runs were then performed,
running the above dashboard against different Snowflake warehouse sizes from XS to XL. All tests
were run for five minutes with a single simulated user using the InteractVizLoadTest script, and the total
number of completed tests and average test response times were recorded. The results are as follows:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 37


As you can see, as we increase the size of the warehouse (from XS up to XL) the average query time
decreases because there are more compute resources available. The improvement tails off from L to
XL as we have reached a point where compute resources are no longer a bottleneck.

You can also see in the first run of the query at each warehouse size (the orange dot in the chart
below), data needs to be read from the underlying table storage into the warehouse cache.
Subsequent runs for each warehouse size can then read data entirely from the warehouse cache,
resulting in improved performance:

To achieve the best results, try to execute relatively homogeneous queries (size, complexity, data sets,
etc.) on the same warehouse; executing queries of widely-varying size and/or complexity on the same
warehouse makes it more difficult to analyze warehouse load, which can make it more difficult to
select the best size to match the size, composition and number of queries in your workload.

MULTI-CLUSTER WAREHOUSES IMPROVE CONCURRENCY


Multi-cluster warehouses enable you to scale warehouse resources to manage your user and query
concurrency needs as they change, such as during peak and off hours.

By default, a warehouse consists of a single cluster of servers that determines the total resources
available to the warehouse for executing queries. As queries are submitted to a warehouse, the
warehouse allocates server resources to each query and begins executing the queries. If resources are
insufficient to execute all the queries submitted to the warehouse, Snowflake queues the additional
queries until the necessary resources become available.

With multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, a larger
pool of resources to each warehouse.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 38


When deciding whether to use multi-cluster warehouses and choosing the number of clusters to use
per warehouse, consider the following:
• Unless you have a specific requirement for running in maximized mode, configure multi-cluster
warehouses for auto-scaling. This enables Snowflake to automatically start and stop clusters as
needed. Snowflake starts new clusters when it detects queries are queuing, and it stops clusters
when it determines that N-1 clusters are sufficient to run the current query load.
• There are two scaling policies available:

–– Standard scaling minimizes queuing by running new clusters as soon as queries start to queue.
The clusters will shut down after 2-3 successful load redistribution checks, which are performed
at one-minute intervals.
–– Economy scaling conserves credits by keeping running clusters fully loaded rather than starting
additional clusters. Economy will start a new cluster if the system estimates there is enough
query load to keep it busy for at least six minutes, and it will shut clusters down after 5-6
successful load redistribution checks.
• Minimum: Keep the default value of 1. This ensures additional clusters start only as needed.
However, if high-availability of the warehouse is a concern, set the value higher than 1. This helps
ensure warehouse availability and continuity in the unlikely event that a cluster fails.
• Maximum: Set this value as large as possible, while being mindful of the warehouse size and
corresponding credit costs. For example, an XL warehouse (16 servers) with maximum clusters = 10
will consume 160 credits in an hour if all 10 clusters run continuously for the hour.

Multi-cluster warehouses are a feature of Snowflake Enterprise edition (or higher). More information
on scaling multi-cluster warehouses is available here.

Example
To demonstrate the scale-out capability of Snowflake, the same dashboard as above was used—this
time against the TPCH-SF1 sample data schema. It was published to Tableau Server, and then, using
TabJolt, a series of test runs were performed, running the above dashboard against different Snowflake
warehouse configurations:

WA R E H O U S E S I Z E # WA R E H O U S E C L U S T E R S T O TA L # S E R V E R S

XS 1, 2, 3, 4 1, 2, 3, 4
S 1, 2, 3, 4 2, 4, 6, 8
M 1 4

All tests were run for five minutes with 20 concurrent threads (simulated users) using the
InteractVizLoadTest script, and the total number of completed tests and average test response times
were recorded. The results were as follows:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 39


The chart shows that as we increase the size of the warehouse (scaling up in a single cluster from XS
-> S -> M) the total number of tests completed increases by 81% and the average test response time
decreases by 44%. This is unsurprising as we have increased the number of servers running queries
from one to two to four. Check out the following chart on the left:

However, when we keep the number of servers constant by running four XS clusters, two S clusters or
a single M cluster (each of which has four servers) we see that the multi-cluster warehouse approach
has significantly better scalability characteristics—55% more completed tests with an average response
time of 36%. Check out the above chart on the right.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 40


Caching
The combination of Tableau and Snowflake has caching at multiple levels. The more effective we can
make our caching, the more efficient our environment will be. Queries and workbooks will return
results faster and less load will be applied to the server layers, leading to greater scalability.

TABLEAU CACHING
Tableau has caching in both Tableau Desktop and Tableau Server which can significantly reduce the
amount of rendering, calculation and querying needed to display views to users. For Tableau Server in
particular, caching is critical for achieving concurrent user scalability.

PRESENTATION LAYER
Caching at the presentation layer is relevant for Tableau Server only. In Tableau Server, multiple end
users view and interact with views via a browser or mobile device. Depending on the capability of the
browser and the complexity of the view, rendering will be done either on the server or on the client.

BROWSER
In the case of client-side rendering, the client browser downloads a JavaScript “viz client” that can
render a Tableau viz. It then requests the initial data package for the view from the server—called the
bootstrap response. This package includes the layout of the view and the data for the marks to be
displayed. Together, they are called the view model.

With this data, the viz client can then draw the view locally in the client’s browser.

As the user explores the view, simple interactions (e.g., tooltips, highlighting and selecting marks) can
be handled locally by the viz client using the local data cache. This means no communication with
the server is required and the result is a fast-screen update. This also reduces the compute load on
the server which helps with Tableau Server scalability. More complex interactions (e.g., changing a
parameter or filter) cannot be handled locally, so the update is sent to the server and the new view
model is sent to update the local data cache.
model

query, filter, calcs, analytics, render, select, highlight,


partition, layout, tooltip, display

updates

Not all vizzes can be rendered client side. There are some complexities that will force Tableau to use
server-side rendering:
• If the user’s browser does not support the HTML 5 <canvas> element;

• If a view uses the polygon mark type or the page history feature; or

• If a view is determined to be overly complex. The complexity of the view is determined by the
number of marks, rows, columns and more.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 41


TILE
In the case of server-side rendering, the view is rendered on the server as a set of static image “tiles.”
These tiles are sent to the browser, and a much simpler viz client assembles them into the final view.
The viz client monitors the session for user interaction (e.g., hovering for a tooltip, selecting marks,
highlighting, interacting with filters, etc.), and if anything needs to be drawn it sends a request back to
the server for new tiles. No data is stored on the client, so all interactions require a server round-trip.
tiles

query, filter, calcs, analytics, partition, layout, display


render, select, highlight, tooltip

mouse events

To help with performance, the tile images are persisted on Tableau Server in the tile cache. In the event
the same tile is requested, it can be served from the cache instead of being re-rendered. Note that the
tile cache can be used only if the subsequent request is for the exact same view. This means requests
for the same view from different sized browser windows can result in different tiles being rendered.

One of the very easy performance optimizations you can make is to set your dashboards to fixed size
instead of automatic. This means the view requests will always be the same size, irrespective of the
browser window dimensions, allowing a better hit rate for the tile cache.

SESSION AND BOOTSTRAP RESPONSE


When rendering a viz, we need the view layout instructions and mark data—referred to as the view
model—before we can do anything. The VizQL Server generates the model based on several factors
including the requesting user credentials and the size of the viewing window. It computes the view of
the viz (i.e., relevant header and axis layouts, label positions, legends, filters, marks and such) and then
sends it to the renderer (either the browser for client-side rendering or the VizQL Server for server-
side rendering).

The VizQL Server on Tableau Server maintains a copy of view models for each user session. This is
initially the default view of the viz, but as users interact with the view in their browser, it is updated
to reflect the view state—highlights, filters, selections, etc. A dashboard can reference multiple view
models—one for each worksheet used. It includes the results from local calculations (e.g., table calcs,
reference lines, forecasts, clustering, etc.) and the visual layout (how many rows/columns to display for
small multiples and crosstabs, the number and interval of axis ticks/grid lines to draw, the number and
location of mark labels to be shown, etc.).

Generating view models can be computationally intensive, so we want to avoid this whenever
possible. The initial models for views—called bootstrap responses—are persisted on Tableau Server by
the Cache Server. When a request comes in for a view, we check the cache to see if the bootstrap
response already exists, and if it does, we serve it up without having to recompute and reserialize it.

Not all views can use the bootstrap response cache, though. Using the following capabilities will
prevent the bootstrap response from being cached:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 42


• If a view uses relative date filters;

• If a view uses user-specific filters; or

• If a view uses published Data Server connections.

Like the tile cache, the bootstrap response is specific to the view size, so requests for the same
view from browser windows with different sizes will produce different bootstrap responses. The
easy performance optimizations you can make is to set your dashboards to be fixed size instead of
automatic. This means the view requests will always be the same size irrespective of the browser
window dimensions, allowing a better hit rate for the bootstrap response cache.

Maintaining the view model is important as it means the VizQL doesn’t need to recompute the view
state every time the user interacts. The models are created and cached in-memory by the VizQL
Server, and it does its best to share results across user sessions where possible. The key factors in
whether a visual model can be shared are:
• The size of the viz display area: Models can be shared only across sessions where the size of the viz
is the same. As already stated above, setting your dashboards to have a fixed size will benefit the
model cache in much the same way as it benefits the bootstrap response and tile caches, allowing
greater reuse and lowering the workload on the server.
• Whether the selections/filters match: If the user is changing filters, parameters, doing drill-down/
up, etc., the model is shared only with sessions that have a matching view state. Try not to publish
workbooks with “show selections” checked as this can reduce the likelihood that different sessions
will match.
• The credentials used to connect to the data source: If users are prompted for credentials to
connect to the data source, then the model can be shared only across user sessions with the same
credentials. Use this capability with caution as it can reduce the effectiveness of the model cache.
• Whether user filtering is used: If the workbook contains user filters or has calculations containing
functions such as USERNAME() or ISMEMBEROF(), then the model is not shared with any other
user sessions. Use these functions with caution as they can significantly reduce the effectiveness of
the model cache.

The session models are not persistent and expire after 30 mins of inactivity (default setting—this
can be changed) in order to recycle memory on the server. If you find that your sessions are expiring
before you are truly finished with a view, consider increasing the VizQL session timeout setting.

ANALYTICS LAYER
The analytics layer includes the data manipulations and calculations performed on the underlying data.
The objective of caching at this level is to avoid unnecessary computations and to reduce the number
of queries sent to the underlying data sources.

The caching in this layer applies to both Tableau Desktop and Tableau Server; however, there are
differences in the caching mechanisms between the two tools. These differences are in line with the
difference between a single user (Desktop) and multi-user (Server) environment.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 43


ABSTRACT QUERY
When a user interacts with a view, many physical queries may be generated. Rather than blindly
executing them all, Tableau groups them into a query batch and decompiles them to see if there are
any optimizations that can be achieved. This could involve removing duplicate queries, combining
multiple similar queries into a single statement or even eliminating queries where the results of one
can be derived from the results of another. Tableau maintains a cache of query results indexed by the
logical structure of the query, and this is checked to see if it is necessary to proceed further to the
native query.

Tableau incorporates logic to not cache the results of:


• queries that return large result sets (too big for the cache);

• queries that execute quickly (faster to run the query than check the cache);

• queries that use user filters; or

• queries that use relative date filters.

DATA LAYER
The data layer addresses the native connection between Tableau and the data sources. Caching at this
level is about persisting the results of queries for reuse in future. It also determines the nature of the
connection to the data source – whether we are using live connections or the Tableau data engine
(replaced with Hyper in Tableau 10.5 and later).

NATIVE QUERY
The native query cache is similar to the abstract query cache, but instead of being indexed by the
logical query structure it is keyed by the actual query statement. Multiple abstract queries can resolve
to the same native query.

Like the abstract query cache, Tableau incorporates logic to not cache the results of:
• queries that return large result sets (too big for the cache);

• queries that execute quickly (faster to run the query than check the cache);

• queries that have user filters; or

• queries that use relative date filters.

SNOWFLAKE CACHING

RESULT CACHING
When a query is executed, the result is persisted in Snowflake for a period of time (currently 24
hours), at the end of which the result is purged from the system.

A persisted result is available for reuse by another query, as long as the user executing the query has
the necessary access privileges for the table(s) used in the query and all of the following conditions
have been met:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 44


• The new query syntactically matches the previously-executed query.

• The table data contributing to the query result has not changed.

• The persisted result for the previous query is still available.

• Any configuration options that affect how the result was produced have not changed.

• The query does not include functions that must be evaluated at execution time
(e.g., CURRENT_TIMESTAMP).

Result reuse can substantially reduce query time because Snowflake bypasses query execution and,
instead, retrieves the result directly from the cache. Each time the persisted result for a query is
reused, Snowflake resets the 24-hour retention period for the result, up to a maximum of 31 days
from the date and time that the query was first executed. After 31 days, the result is purged, and the
next time the query is submitted, a new result is returned and persisted.

Result reuse is controlled by the session parameter USE_CACHED_RESULT. By default, the parameter
is enabled but can be overridden at the account, user and session level if desired.

Note that the result cache in Snowflake will contain very similar data as the native query cache in
Tableau; however, they operate independently.

WAREHOUSE CACHING
Each warehouse, when running, maintains a cache of table data accessed as queries are processed by
the warehouse. This enables improved performance for subsequent queries if they are able to read
from the cache instead of from the table(s) in the query. The size of the cache is determined by the
number of servers in the warehouse; i.e., the larger the warehouse (and, therefore, the number of
servers in the warehouse), the larger the cache.

This cache is dropped when the warehouse is suspended, which may result in slower initial
performance for some queries after the warehouse is resumed. As the resumed warehouse runs and
processes more queries, the cache is rebuilt, and queries that are able to take advantage of the cache
will experience improved performance.

Keep this in mind when deciding whether to suspend a warehouse or leave it running. In other words,
consider the trade-off between saving credits by suspending a warehouse versus maintaining the
cache of data from previous queries to help with performance.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 45


Other performance considerations
CONSTRAINTS
Constraints define integrity and consistency rules for data stored in tables. Snowflake provides support
for constraints as defined in the ANSI SQL standard, as well as some extensions for compatibility with
other databases, such as Oracle. Constraints are provided primarily for data modelling purposes and
compatibility with other databases, as well as to support client tools that utilize constraints.

It is recommended that you define constraints between tables you intend to use in Tableau. Tableau
uses constraints to perform join culling (join elimination), which can improve the performance of
generated queries. If you cannot define constraints, be sure to set “Assume Referential Integrity” in the
Tableau data source to allow the query generator to cull unneeded joins.

EFFECT OF READ-ONLY PERMISSIONS


Having the ability to create temp tables is important for Tableau users; when available, they are used
in multiple situations (e.g., complex filters, actions, sets, etc.). When they are unavailable (e.g., if you
are working with a shared database via a read-only account), Tableau will try to use alternate query
structures, but these can be less efficient and in extreme cases can cause errors.

Example
See the following visualization—a scatter plot showing Order Total Price vs. Supply Cost for each
Customer Name in the TPCH-SF1 sample data set. We have lassoed a number of the points and
created a set which is then used on the color shelf to show IN/OUT:

When we create the set, Tableau creates a temp table (in the background) to hold the list of
dimensions specified in the set (in this case, the customer name):

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 46


CREATE LOCAL TEMPORARY TABLE "#Tableau_37_2_Filter" (
"X_C_NAME" VARCHAR(18) NOT NULL,
"X__Tableau_join_flag" BIGINT NOT NULL
) ON COMMIT PRESERVE ROWS

We then populate this temp table with the selected values:

INSERT INTO "#Tableau_37_2_Filter" ("X_C_NAME", "X__Tableau_join_flag")


VALUES (?, ?)

Finally, the following query is then executed to generate the required result set:

SELECT "Customers"."C_NAME" AS "C_NAME",


(CASE WHEN ((CASE WHEN (NOT ("Filter_1"."X__Tableau_join_flag" IS NULL)) THEN 1 ELSE 0 END) = 1) THEN 1
ELSE 0 END) AS "io:Set 1:nk",
SUM("Orders"."O_TOTALPRICE") AS "sum:O_TOTALPRICE:ok",
SUM("PARTSUPP"."PS_SUPPLYCOST") AS "sum:PS_SUPPLYCOST:ok"
FROM "TPCH1"."LINEITEM" "LineItem"
INNER JOIN "TPCH1"."ORDERS" "Orders" ON ("LineItem"."L_ORDERKEY" = "Orders"."O_ORDERKEY")
INNER JOIN "TPCH1"."CUSTOMER" "Customers" ON ("Orders"."O_CUSTKEY" = "Customers"."C_CUSTKEY")
INNER JOIN "TPCH1"."PART" "Parts" ON ("LineItem"."L_PARTKEY" = "Parts"."P_PARTKEY")
INNER JOIN "TPCH1"."PARTSUPP" "PARTSUPP" ON ("Parts"."P_PARTKEY" = "PARTSUPP"."PS_PARTKEY")
LEFT JOIN "#Tableau_37_2_Filter" "Filter_1" ON ("Customers"."C_NAME" = "Filter_1"."X_C_NAME")
GROUP BY 1,
2

If a user does not have the right to create temp tables in the target database or schema, Tableau will
create WHERE IN clauses in the query to specify the set members:

SELECT "Customers"."C_NAME" AS "C_NAME"


FROM "TPCH_SF1"."CUSTOMER" "Customers"
WHERE ("Customers"."C_NAME" IN ('Customer#000000388', 'Customer#000000412', 'Customer#000000679',
'Customer#000001522', 'Customer#000001948', 'Customer#000001993', 'Customer#000002266',
'Customer#000002548', 'Customer#000003019', 'Customer#000003433', 'Customer#000003451',

'Customer#000149362', 'Customer#000149548'))
GROUP BY 1
ORDER BY 1 ASC

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 47


However, this approach is subject to the limitations of the query parser, and you will encounter errors
if you try to create a set with > 16,384 marks:

This error could be overcome by converting the data connection from a live connection to an
extracted connection; however, this is a viable approach only if the data volumes are not massive.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 48


Measuring performance
IN TABLEAU

PERFORMANCE RECORDER
The first place you should look for performance information is the Performance Recording feature of
Tableau Desktop and Server. You enable this feature under the Help menu:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 49


Start performance recording, then open your workbook. Interact with it as if you were an end user, and
when you feel you have gathered enough data, go back in the Help menu and stop recording. Another
Tableau Desktop window will open at this point with the data captured:

You can now identify the actions in the workbook that take the most time; for example, in the above image,
the selected query takes 0.8 seconds to complete. Clicking on the bar shows the text of the query being
executed. As the output of the performance recorder is a Tableau Workbook, you can create additional
views to explore this information in other ways.

You should use this information to identify those sections of a workbook that are the best candidates
for review; i.e., where you get the best improvement for the time you spend. More information on
interpreting these recordings can be found by clicking the following link:

http://tabsoft.co/1RdF420

LOGS

DESKTOP LOGS
In Tableau, you can find detailed information on what Tableau is doing in the log files. The default
location for Desktop is C:\Users\<username>\Documents\My Tableau Repository\Logs\log.txt.
This file is quite verbose and is written as JSON encoded text, but if you search for “begin-query” or
“end-query” you can find the query string being passed to the data source. Looking at the “end-query”
log record will also show you the time the query took to run and the number of records that were
returned to Tableau:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 50


{"ts":"2015-05-24T12:25:41.226","pid":6460,"tid":"1674","sev":"info","req":"-","sess":"-","site":"-","user":"-","k":"end—
query", "v":{"protocol":"4308fb0","cols":4,"query":"SELECT [DimProductCategory].[ProductCategoryName]
AS [none:ProductCategoryName:nk],\n [DimProductSubcategory].[ProductSubcategoryName] AS
[none:ProductSubcategoryName:nk],\n SUM(CAST(([FactSales].[ReturnQuantity]) as BIGINT)) AS
[sum:ReturnQuantity:ok],\n SUM([FactSales].[SalesAmount]) AS [sum:SalesAmount:ok]\nFROM [dbo].[FactSales]
[FactSales]\n INNER JOIN [dbo].[DimProduct] [DimProduct] ON ([FactSales].[ProductKey] = [DimProduct].
[ProductKey])\n INNER JOIN [dbo].[DimProductSubcategory] [DimProductSubcategory] ON ([DimProduct].
[ProductSubcategoryKey] = [DimProductSubcategory].[ProductSubcategoryKey])\n INNER JOIN [dbo].
[DimProductCategory] [DimProductCategory] ON ([DimProductSubcategory].[ProductCategoryKey] =
[DimProductCategory].[ProductCategoryKey])\nGROUP BY [DimProductCategory].[ProductCategoryName],\n
[DimProductSubcategory].[ProductSubcategoryName]","rows":32,"elapsed":0.951}}

Since Tableau 10.1, JSON files are a supported data source, so you can also open the log files with
Tableau for easier analysis. You can analyze the time it takes for each event, what queries are being
run and how much data they are returning:

Another useful tool for analyzing Tableau Desktop performance is the Tableau Log Viewer. This cross-
platform tool allows you to easily view, filter and search the records from Tableau's log file. A powerful
feature of TLV is that it supports live monitoring of Tableau Desktop, so you can see in real-time the
log entries created by your actions in Tableau.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 51


This tool is made available as-is by Tableau and can be downloaded from GitHub:

http://bit.ly/2rQJ6cu

Additionally, you could load the JSON logs into Snowflake and take advantage of the VARIANT data
type to help with your analysis.

SERVER LOGS
If you are looking on Tableau Server, the logs are in C:\ProgramData\Tableau\Tableau Server\data\
tabsvc\<service name>\Logs. In most cases, you will be interested in the VizQL Server log files. If you
do not have console access to the Tableau Server machine, you can download a ZIP archive of the log
files from the Tableau Server status page:

Just like the Tableau Desktop log files, these are JSON formatted text files, so you can open them in
Tableau Desktop or can read them in their raw form. However, because the information about user
sessions and actions is spread across multiple files (corresponding to the various services of Tableau
Server), you may prefer to use a powerful tool called Logshark to analyze Server logs. Logshark is a
command-line utility that you can run against Tableau logs to generate a set of workbooks that provide
insights into system performance, content usage and error investigation. You can use Logshark to
visualize, investigate and solve issues with Tableau at your convenience.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 52


You can find out more about Logshark at the following link:

http://tabsoft.co/2rQK8oS

And you can download it from GitHub:

http://bit.ly/2rQS0qn

Additionally, you could load the JSON logs into Snowflake and take advantage of the VARIANT data
type to help with your analysis.

SERVER PERFORMANCE VIEWS


Tableau Server comes with several views to help administrators monitor activity on Tableau Server.
The views are located in the Analysis table on the server’s Maintenance page. They can provide
valuable information on the performance of individual workbooks, helping you focus your attention on
those that are performing poorly:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 53


More information on these views can be found at the following link:

http://tabsoft.co/1RjCCL2

Additionally, custom administrative views can be created by connecting to the PostgreSQL database
that makes up part of the Tableau repository. Instructions can be found here:

http://tabsoft.co/1RjCACR

TABMON
TabMon, is an open source cluster monitor for Tableau Server that allows you to collect performance
statistics over time. TabMon is community-supported, and we are releasing the full source code under
the MIT open source license.

TabMon records system health and application metrics out of the box. It collects built-in metrics like
Windows Perfmon, Java Health and Java Mbean (JMX) counters on Tableau Server machines across a
network. You can use TabMon to monitor physical (CPU, RAM), network and hard-disk usage. You can
track cache-hit ratio, request latency, active sessions and much more. It displays the data in a clean,
unified structure, making it easy to visualize the data in Tableau Desktop.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 54


TabMon gives you full control over which metrics to collect and which machines to monitor, no
scripting or coding required. All you need to know is the machine and the metric name. TabMon
can run both remotely and independently of your cluster. You can monitor, aggregate and analyze
the health of your cluster(s) from any computer on your network with almost no added load to your
production machines.

You can find more information on TabMon here:

http://bit.ly/1ULFelf

TABJOLT
While not exactly a performance monitoring tool, TabJolt is very useful to help you determine if your
problems are related to platform capacity issues. TabJolt is particularly useful for testing Snowflake
multi-cluster warehouse scenarios when you want to simulate a load generated by multiple users.

TabJolt is a “point-and-run” load and performance testing tool specifically designed to work easily with
Tableau Server. Unlike traditional load-testing tools, TabJolt can automatically drive load against your
Tableau Server without script development or maintenance. Because TabJolt is aware of Tableau’s
presentation model, it can automatically load visualizations and interpret possible interactions during
test execution.

This allows you to just point TabJolt to one or more workbooks on your server and automatically load
and execute interactions on the Tableau views. TabJolt also collects key metrics including average
response time, throughput and 95th percentile response time, and captures Windows performance
metrics for correlation.

Of course, even with TabJolt, users should have sufficient knowledge of Tableau Server’s architecture.
Treating Tableau Server as a black box for load testing is not recommended and will likely yield results
that aren’t in line with your expectations.

You can find more information on TabJolt here:

http://bit.ly/1ULFtgi

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 55


IN SNOWFLAKE
Snowflake has several built-in features that allow you to monitor performance of the queries
generated by Tableau and to link them back to specific workbooks, data connections and users.

SNOWFLAKE INFORMATION SCHEMA


The Snowflake Information Schema (aka “Data Dictionary”) is a set of system-defined views and
functions that provide extensive information about the objects created in your account. The
Snowflake Information Schema is based on the SQL-92 ANSI Standard Information Schema, but also
includes views and functions that are specific to Snowflake.

SNOWFLAKE QUERY HISTORY


Within the Information Schema is a set of tables and table functions that can be used to retrieve
information on queries that have been run during the past 7 days.

https://docs.snowflake.net/manuals/sql-reference/functions/query_history.html

These queries can return detailed information on the profile execution timing and profile of queries,
which can then be used to determine if your warehouse sizes are appropriate. You can see information
on the start time, end time, query duration, # rows returned and the data volume scanned. You can
also see the amount of time queries spend queued vs. executing, which is useful information when
considering adding multi-cluster scaling to your warehouses.

A useful way to access this information is through the query_history table functions via Tableau:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 56


The query history for an account can also be accessed from the Snowflake web interface:

This page displays all queries executed in the last 14 days, including queries executed by you in
previous sessions and queries executed by other users. This view shows query duration broken down
by queueing, compilation and execution and is also useful for identifying queries that are resolved
through the result cache.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 57


By clicking on the hyperlink in the Query ID column, you can drill through to more detailed statistics
for the selected query:

This is also where you have access to the query profile.

SNOWFLAKE QUERY PROFILE


Query Profile is a powerful tool for understanding the mechanics of queries. Implemented as a page in
the Snowflake web interface, it provides a graphical tree view, as well as detailed execution statistics,
for the major components of the processing plan for a query. It is designed to help troubleshoot issues
in SQL query expressions that commonly cause performance bottlenecks.

To access Query Profile:


• Go to the Worksheet or History page in the web interface.

• Click on the Query ID for a completed query.

• In the query details, click the Profile tab.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 58


To help you analyze query performance, the detail panel provides two classes of profiling information:

EXECUTION TIME
Execution time provides information about “where the time was spent” during the processing of a
query. Time spent can be broken down into the following categories, displayed in the following order:
• Processing—time spent on data processing by the CPU

• Local Disk IO—time when the processing was blocked by local disk access

• Remote Disk IO—time when the processing was blocked by remote disk access

• Network Communication—time when the processing was waiting for the network data transfer

• Synchronization—various synchronization activities between participating processes

• Initialization—time spent setting up the query processing

STATISTICS
A major source of information provided in the detail panel consists of various statistics, grouped in the
following sections:
• IO—information about the input-output operations performed during the query:

–– SCAN PROGRESS—the percentage of data scanned for a given table so far


–– BYTES SCANNED—the number of bytes scanned so far
–– PERCENTAGE SCANNED FROM CACHE—the percentage of data scanned from the local disk cache
–– BYTES WRITTEN—bytes written (e.g., when loading into a table)
–– BYTES WRITTEN TO RESULT—bytes written to a result object
–– BYTES READ FROM RESULT—bytes read from a result object
–– EXTERNAL BYTES SCANNED—bytes read from an external object (e.g., a stage)

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 59


• DML—statistics for Data Manipulation Language (DML) queries:

–– NUMBER OF ROWS INSERTED—number of rows inserted into a table (or tables)


–– NUMBER OF ROWS UPDATED—number of rows updated in a table
–– NUMBER OF ROWS DELETED—number of rows deleted from a table
–– NUMBER OF ROWS UNLOADED—number of rows unloaded during data export
–– NUMBER OF BYTES DELETED—number of bytes deleted from a table

• Pruning—information on the effects of table pruning:

–– PARTITIONS SCANNED—number of partitions scanned so far


–– PARTITIONS TOTAL—total number of partitions in a given table

• Spilling—information about disk usage for operations where intermediate results do not fit in memory:

–– BYTES SPILLED TO LOCAL STORAGE—volume of data spilled to local disk


–– BYTES SPILLED TO REMOTE STORAGE—volume of data spilled to remote disk

• Network—network communication:

–– BYTES SENT OVER THE NETWORK—amount of data sent over the network

This information can help you identify common problems that occur with SQL queries

“EXPLODING” JOINS
Two common mistakes users make is joining tables without providing a join condition (resulting in
a “Cartesian Product”) and getting the join condition incorrect (resulting in records from one table
matching multiple records from another table). For such queries, the Join operator produces
significantly (often by orders of magnitude) more tuples than it consumes.

This can be observed by looking at the number of records produced by a Join operator and typically
is also reflected in Join operator consuming a lot of time.

The following example shows input in the hundreds of records but output in the hundreds of thousands:

select tt1.c1, tt1.c2


from tt1
join tt2 on tt1.c1 = tt2.c1
and tt1.c2 = tt2.c2;

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 60


The recommended action for this problem is to review the join conditions defined in the Tableau data
connection screen and correct any omissions.

QUERIES TOO LARGE TO FIT IN MEMORY


For some operations (e.g., duplicate elimination for a huge data set), the amount of memory available
for the servers used to execute the operation might not be sufficient to hold intermediate results. As
a result, the query processing engine will start spilling the data to a local disk. If the local disk space is
insufficient, the spilled data is then saved to remote disks.

This spilling can have a profound effect on query performance (especially if data is spilled to a remote
disk). To alleviate this, we recommend:
• Using a larger warehouse (effectively increasing the available memory/local disk space for the
operation), and/or
• Processing data in smaller batches

INEFFICIENT PRUNING
Snowflake collects rich statistics on data allowing it to avoid reading unneeded parts of a table based
on the query filters. However, for this to have an effect, the data storage order needs to be correlated
with the query filter attributes.

The efficiency of pruning can be observed by comparing Partitions scanned and Partitions total
statistics in the TableScan operators. If the former is a small fraction of the latter, pruning is efficient. If
not, the pruning did not have an effect.

Of course, pruning can help only for queries that actually filter out a significant amount of data.
If the pruning statistics do not show data reduction, but there is a Filter operator above TableScan that
filters out a number of records, this might signal that a different data organization might be beneficial
for this query.

LINKING PERFORMANCE DATA BETWEEN TABLEAU AND SNOWFLAKE


One of the challenges users will face when examining performance is connecting queries that run in
Snowflake with the workbooks and user sessions in Tableau that generated them. One useful way to
do this is to set the QUERY_TAG session variable using an initial SQL statement.

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 61


Tableau provides parameters that can be inserted into initial SQL statements that will pass through
attributes from the Tableau environment into Snowflake:

PA R A M E T E R DESCRIPTION EX AMPLE OF RETURNED


VA L U E

TableauServerUser The user name of the current server user. Use when asmith
setting up impersonation on the server. Returns an empty
string if the user is not signed in to Tableau Server
TableauServerUserFull The user name and domain of the current server user. domain.lan\asmith
Use when setting up impersonation on the server.
Returns an empty string if the user is not signed in to
Tableau Server.
TableauApp The name of the Tableau application. Tableau Desktop Professional
Tableau Server
TableauVersion The version of the Tableau application. 9.3
WorkbookName The name of the Tableau workbook. Use only in Financial-Analysis
workbooks with an embedded data source.

If you use the TableauServerUser or TableauServerUserFull parameter in an initial SQL statement,


you will create a dedicated connection in Tableau that can’t be shared with other users. This will also
restrict cache sharing, which can enhance security but may slow performance.

For more detailed information, see the Tableau documentation here:

http://onlinehelp.tableau.com/current/pro/desktop/en-us/connect_basic_initialsql.html

Example
Using an initial SQL block, the following statement is used to set the QUERY_TAG session variable:

ALTER SESSION SET QUERY_TAG = [WorkbookName][TableauApp][TableauServerUser];

The resulting value is then available from the Query History table function and can be used to
attribute queries to specific workbooks and users:

BEST PRACTICES FOR USING TABLEAU WITH SNOWFLAKE 62


Find out more
Snowflake is the only data warehouse built for the cloud, enabling the data-driven enterprise with instant
elasticity, secure data sharing and per-second pricing. Snowflake combines the power of data warehousing,
the flexibility of big data platforms and the elasticity of the cloud at a fraction of the cost of traditional
solutions. Snowflake: Your data, no limits. Find out more at snowflake.net.

Vous aimerez peut-être aussi