![]() ![]() We followed Jostein Leira’s excellent Airflow setup guide to get the Airflow server up and running. We hosted the Airflow server on a virtual machine running in a VPC within Google Cloud Platform. Your dbt transformation process could evolve substantially while keeping this same architecture in place. Our use case is to use dbt to aggregate the average HP score across pokemon catches. We used a publicly available dataset on Pokemon for this exercise, saved in a Google Sheet. ![]() Access to a Google Cloud Platform account to host the Airflow server.The dbt Cloud Account ID, which is used in the API calls to dbt Cloud.Snowflake database to accept data loads and run the transforms.Navigate to Profile > API Access from the dbt Cloud console to generate this. This link documents how to create this value yourself. These are encoded into an api token in base64. To implement this yourself, at a minimum you will need access to the following: To be clear, we didn’t seek to implement a complete production-ready system for this task, but rather to create a starting point. You can find the codebase we implemented in writing this article here. Leveraged dbt’s native ability to parallelize job runs.Minimal re-work on the Airflow side as the dbt transformation and Fivetran processes grow in complexity over time.Isolation between the data loading (Fivetran), transformation (dbt) and orchestration (Airflow) functions in the stack. ![]() These include the ability to run Pull Request checks with dbt Test independently of Airflow use. That said, there are some additional benefits to using dbt Cloud. Our friends at Astronomer have a great series of blog posts featuring orchestration of dbt Core with Airflow. Note that an organization using dbt Core could accomplish a very similar workflow using the Airflow Bash Operator to trigger a dbt run. Our goal is to present a simplified, linear workflow and illustrate how to coordinate tasks using the Fivetran and dbt Cloud APIs with Airflow XComs. In that case, event-driven architectures like the one we describe in this article are key. However, many of the larger organizations (2,000+ employees) that and I work with as members of the Solutions Architecture team at Fishtown view these as high-priority items. If these factors are not a high priority, then it’s completely valid to use scheduled processes. These might include supplying fresh data for a machine learning model to consume, or reporting SLA requirements.
0 Comments
Leave a Reply. |