Migrate from dbt-spark to dbt-databricks
Introduction
You can migrate your projects from using the dbt-spark
adapter to using the dbt-databricks adapter. In collaboration with dbt Labs, Databricks built this adapter using dbt-spark as the foundation and added some critical improvements. With it, you get an easier set up — requiring only three inputs for authentication — and more features such as support for Unity Catalog.
Prerequisites
- Your project must be compatible with dbt 1.0 or greater. Refer to Upgrading to v1.0 for details. For the latest version of dbt, refer to Upgrading to v1.7.
- For dbt Cloud, you need administrative (admin) privileges to migrate dbt projects.
Simpler authentication
Previously, you had to provide a cluster
or endpoint
ID which was hard to parse from the http_path
that you were given. Now, it doesn't matter if you're using a cluster or an SQL endpoint because the dbt-databricks setup requires the same inputs for both. All you need to provide is:
- hostname of the Databricks workspace
- HTTP path of the Databricks SQL warehouse or cluster
- appropriate credentials
Better defaults
The dbt-databricks
adapter provides better defaults than dbt-spark
does. The defaults help optimize your workflow so you can get the fast performance and cost-effectiveness of Databricks. They are:
- The dbt models use the Delta table format. You can remove any declared configurations of
file_format = 'delta'
since they're now redundant. - Accelerate your expensive queries with the Photon engine.
- The
incremental_strategy
config is set tomerge
.
With dbt-spark, however, the default for incremental_strategy
is append
. If you want to continue using incremental_strategy=append
, you must set this config specifically on your incremental models. If you already specified incremental_strategy=merge
on your incremental models, you don't need to change anything when moving to dbt-databricks; but, you can keep your models clean (tidy) by removing the config since it's redundant. Read About incremental_strategy to learn more.
For more information on defaults, see Caveats.
Pure Python
If you use dbt Core, you no longer have to download an independent driver to interact with Databricks. The connection information is all embedded in a pure-Python library called databricks-sql-connector
.
Migrate your dbt projects in dbt Cloud
You can migrate your projects to the Databricks-specific adapter from the generic Apache Spark adapter. If you're using dbt Core, then skip to Step 4.
The migration to the dbt-databricks
adapter from dbt-spark
shouldn't cause any downtime for production jobs. dbt Labs recommends that you schedule the connection change when usage of the IDE is light to avoid disrupting your team.
To update your Databricks connection in dbt Cloud:
- Select Account Settings in the main navigation bar.
- On the Projects tab, find the project you want to migrate to the dbt-databricks adapter.
- Click the hyperlinked Connection for the project.
- Click Edit in the top right corner.
- Select Databricks for the warehouse
- Enter the:
hostname
http_path
- (optional) catalog name
- Click Save.
Everyone in your organization who uses dbt Cloud must refresh the IDE before starting work again. It should refresh in less than a minute.
Configure your credentials
When you update the Databricks connection in dbt Cloud, your team will not lose their credentials. This makes migrating easier since it only requires you to delete the Databricks connection and re-add the cluster or endpoint information.
These credentials will not get lost when there's a successful connection to Databricks using the dbt-spark
ODBC method:
- The credentials you supplied to dbt Cloud to connect to your Databricks workspace.
- The personal access tokens your team added in their dbt Cloud profile so they can develop in the IDE for a given project.
- The access token you added for each deployment environment so dbt Cloud can connect to Databricks during production jobs.
Migrate dbt projects in dbt Core
To migrate your dbt Core projects to the dbt-databricks
adapter from dbt-spark
, you:
- Install the dbt-databricks adapter in your environment
- Update your Databricks connection by modifying your
target
in your~/.dbt/profiles.yml
file
Anyone who's using your project must also make these changes in their environment.
Try these examples
You can use the following examples of the profiles.yml
file to see the authentication setup with dbt-spark
compared to the simpler setup with dbt-databricks
when connecting to an SQL endpoint. A cluster example would look similar.
An example of what authentication looks like with dbt-spark
:
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: odbc
driver: '/opt/simba/spark/lib/64/libsparkodbc_sb64.so'
schema: my_schema
host: dbc-l33t-nwb.cloud.databricks.com
endpoint: 8657cad335ae63e3
token: [my_secret_token]
An example of how much simpler authentication is with dbt-databricks
:
your_profile_name:
target: dev
outputs:
dev:
type: databricks
schema: my_schema
host: dbc-l33t-nwb.cloud.databricks.com
http_path: /sql/1.0/endpoints/8657cad335ae63e3
token: [my_secret_token]