8 key things to consider when choosing your ETL

8 key things to consider when choosing your ETL

Most of your internal data is trapped into your SaaS tools, making it impossible to query or combine it.

You need a specific "Data extractor" tool that can get your data from your favorite tools and put it into one place (generally a database). Those extractors are called "ETL" for Extract-Transform-Load or "ELT" Extract-Load-Transform.

Note: The name difference (ETL vs. ELT) is explained a lot on the internet and is irrelevant in this article, so we'll call all Data extractor platforms ETL.

When building Whaly ๐Ÿณ, we tested numerous ETL tools for our customers and we forged our opinions on what ETL should be like. Here is a review of the criteria we considered for choosing your ETL based on our own experiences.

๐Ÿ‡ Fast

To get confidence in your data analysis and dashboards, you need fresh data at your disposal. To do so, you need an ETL that runs often and fast. Data should be here in seconds, or minutes, not hours โณ

Top players:

  • Fivetran: they're only syncing data "updates" (no full copy) and they do it fast, with optimized Java.

Second-rate players:

  • Singer / Stitch connectors are the worst on speed: connectors are often doing "full extract" of data at each run with un-optimized synchronous Python.

๐Ÿ’ธ Cheap

If moving data is expensive, you will probably stop doing it, or be really selective over what you extract, blinding yourself in the process from all the insights you could get.

So, your ETL solution should be in the 10-100$ range per month per data source, not more.

Top players:

  • Singer, Meltano, Airbyte: Self-hosted solution so you pay the cheapest price

Second-rate players:

  • Fivetran is the worst here. They bill you 200-1000$ per source per month quite easily (average CRM of 10k+ contacts)

๐Ÿ”— Relational

Data that you move always have relations. A campaign is linked to a creative and a spent table. A contact is linked to a company, etc.

The information about how your extracted tables are related to each other is living in the data source, it should be extracted with the data.

Top players:

  • Fivetran is offering you a diagram with some relationships (ERD diagram) in its documentation. It's better than nothing.

Second-rate players:

  • All the other ETL tools. None of them are fetching the schemas. You will need to build that again by yourself.

๐Ÿงน Clean data model

Some APIs endpoints are a mess and some normalization need to happen because nobody wants to start analyzing data directly from the "API result".

A first normalization step to convert timestamp, to store data into tables when it makes sense, to remove invalid data is needed so that data analyst can keep their sanity and be productive.

Top players:

  • Fivetran is offering a proper data model for each tool.

Second-rate players:

  • Singer / Stitch / Airbyte: those connectors are terrible at delivering properly modeled data, they dump the API results into tables and call it a day. ๐Ÿ˜ก

๐Ÿšฎ Hard delete

When something is deleted in a source system, it should be deleted or flagged in the data warehouse.

Weirdly, most connectors around there are not making the extra mile needed to make this a reality.

Top players:

  • Fivetran is doing the extra effort of dealing with deletes (webhooks, doing scheduled full copy)

Second-rate players:

  • Stitch / Singer: the connectors often don't care about propagating deletes... ๐Ÿค”

๐Ÿ†• Reactive on schema changes

Data is the lifeblood of business. When business changes, data change accordingly.

This means new attributes are getting created, updated, or deleted every day/week. It's the same for objects.

A good ETL tool understands that it should be a breeze to manage those changes and react properly to them.

Top players:

  • Fivetran: by creating/updating columns and tables on the fly when a new schema is published, Fivetran is helping users to cope with the new changes.

Second-rate players:

  • Stitch/singer: by requiring manual action/configuration to be done to update a schema, those tools are making it hard to evolve with the data schema.

๐Ÿ‘‹ Extensible and open

Sometimes, you will need to tweak the code of your ETL to add a specific property or improve a specific connector that only suits your need. Having access to the source code and being able to edit it to fit your needs can therefore come in very handy.

Sadly, not all ETL share their connector source code or let end-users contribute or run their own modified version of the connector.

Top players:

  • Stitch/Singer/Airbyte: by keeping all the connectors open source, those tools are letting end-users contribute.

Second-rate players:

  • Fivetran: by keeping its connector closed source, Fivetran is not letting the end-user/community adapt the connectors to their needs and/or to help to develop them.

๐Ÿ‘€ Observable

Being able to get a good level of monitoring of what was synced, what just failed, why, and how it did so, is paramount for any data pipeline such as an ETL.

So a proper ETL platform should be tightly integrated with monitoring tools to help Ops manage and react to any weird behavior.

Top players:

  • Fivetran: by integrating itself with Cloud Monitoring tools (Google Stackdriver, etc.) Fivetran is letting people build their own metrics / alerting system with their Cloud tools.

Conclusion

As you can see, no ETL platform crosses each of the boxes, they each have limitations in one domain or another.

Fivetran will be more costly but save you time on the data/engineering side because your team will have less work to do in terms of preparation and transformations.

Stitch and the other players are cheaper. However, your data/engineering team will spend a lot of time transforming the data once loaded.

At Whaly ๐Ÿณ, as those are the characteristics where we can't do any compromises (because our users won't ๐Ÿ˜ƒ), we had to build our own connectors that check all of the points above ๐Ÿ’ช