The key considerations and criteria when choosing a happy home for your data. Getting a cloud-based data warehouse is a crucial step in setting up your stack.
Aug 4, 2022 - 5 min read
A data warehouse, intuitively, is a “shelter” that houses and stores all your data in one place. Simple concept to grasp, but with all the options available these days and the important role that it plays in business intelligence, it can be difficult to figure out which one makes the most sense for your business. It’s one of those topics that can send you spiraling down a rabbit hole trying to collect opinions and answers. Not only is it time-consuming to come to a conclusion, reversing the decision is also likely complicated. Through this article, we hope to help you out with your decision, as it’s truly a crucial step in getting your data stack set-up and for your business leaders to start deriving value from your data.
While data warehouses have been around for decades in enterprise analytics, clouddata warehousing has changed the game in recent years - turning the tide on on-premise-only solutions and offering something more accessible that can scale with today’s massive surge in business data. With cloud data warehousing, you can say goodbye to the headache and constraints of managing a physical data center. We believe this is the way to go to reduce your investment in hardware, and to ensure that your data warehouse can scale up and down as your business evolves. RedShift, BigQuery, and Snowflake are a few of the main players in this realm.
Why do you need a data warehouse?
Let’s start with the big picture purpose of why it’s a good idea to get a data warehouse in the first place.
It’s a central, single source-of-truth into which data is collected from scattered sources.
This is done through an ETL process, during which data is usually converted from different formats into one standard format, which makes it highly consistent and easy to use. All business functions (sales, marketing, and more) can pull from the same resource.
With a central repository that encompasses every type of data, you have full flexibility over how you store and query your data. When it’s all there in one place, you can run queries directly in your warehouse or through a Business Intelligence tool like Whaly. There’s no need for business teams to log into disparate systems to get the answers they need - they can access it from one interface which makes data much more accessible.
A data warehouse ensures that data is up to date and accurate, leading to high integrity.
Last but not least, data warehouses use a cleaningprocess to eliminate poor-quality information. Duplicate, corrupted, or inaccurate data gets removed which means you’re left with a shiny, beautiful home for your data to live in.
The evaluation process:
Before you begin your evaluation, establish clear uses cases. What type of data do you need? How will you use it and what will you do with it? Let your use cases be your guiding light during the search process. A few examples of use cases are: analyzing and optimizing sales processes, tracking and improving company revenue and growth, improving relationships with customers and reducing churn, and analyzing operations and logistics - just to name a few.
Considerations & criteria:
In your evaluation process, make sure you consider these different angles to cover all your bases:
Data types: what type of data do you want your warehouse to store? Is the data structured? Redshift, BigQuery, and Snowflake work with structured data.
Performance & speed: this refers to your requirements for query response time, report response time, and time to upload, update, or refresh the data warehouse.
Pricing and cost transparency: Your budget and the pricing structure of the solution are key considerations, particularly as they differ quite a lot. Will it cost money now, or in the future? How will their pricing evolve? Redshift, for example, charges by the hour based on the type and number of nodes in your cluster, or by the number of bytes scanned, while BigQuery charges for each query. They also charge for storage, which means the bill will only go up as you accumulate more data. This is where BigQuery can be a little less predictable - it may seem cheaper in the short-term, but can get expensive in the long-term. Pricing also relates to scalability. Snowflake is based on storage and compute time, and scales nicely because you can separate projects and control costs of each, which isn’t possible with BigQuery. The pricing models can be complex, so make sure you get a full understanding before you make the call.
Security & Compliance: what are the regulations you need to adhere to, and is the solution compliant? It’s important to be assured that the data will not leak to malicious third parties. Redshift, BigQuery, and Snowflake all take security seriously and have these features built-in.
Availability: this should be a top concern for any data-driven company. Reliance on data is increasing exponentially with a higher demand for real-time analytics for quicker and smarter decision-making. Your data warehouse should offer a high level of availability and access across teams and functions.
Volume & Scalability: roughly how much data do you need to store now, and how much will that grow in the future? This question is usually hard to answer, especially if you’re a fast-growing business. It’s always best to go with one that can infinitely scale with your business in an easy, elastic way, as opposed to running into storage and other issues down the road. Snowflake automatically scales to keep queries fast, and BigQuery automatically resizes your warehouse without storage limits. Redshift can scale horizontally by manually adding a few nodes.
Maintenance: how much engineering effort is required to maintain the warehouse? Do you have the resources in place? BigQuery and Snowflake are fully managed, for example, while Redshift requires some manual maintenance. Redshift is generally harder to set-up and manage, which can take more time and resources.
Language: SQL vs. python. SQL is a programming language that is designed for managing data held in a relational database management system. An SQL-first warehouse tends to be more compatible with business intelligence platforms that can simplify your process for analyzing and visualizing the data once it’s in the warehouse.
The APIs / SDKs:how easy it is to load into the data warehouse and play with it afterwards? Does it connect seamlessly and effectively to BI, ETL, SDKs, modeling tools, etc.?
Cloud integration: how well does it integrate with your cloud? How easily can you build bridges between your existing cloud services and this data warehouse? Can you reuse all the access controls that you deployed on your cloud? Snowflake can be deployed on any cloud.
In the cloud-based data warehouse world, Redshift, BigQuery, and Snowflake are similar in that they are scalable, flexible solutions that provide more elasticity and cost savings than traditional on-premise solutions. The considerations above should help you evaluate from different angles and not miss out on any key criteria. How the services are billed varies quite a lot, so look at your workload, your volume of data, and how many queries you’ll likely run - but note that this will likely increase in the future, so scalability is key. Whether or not you have engineers and resources on hand for maintenance is also important to consider. Lastly, if you have a BI tool in mind, you can check whether they connect natively with your data warehouse.