Databricks
Databricks a composible data lake and lakehouse architecture platform that runs on all major cloud vendor systems to enable ML, AI, Data Warehousing, and Data Science capabilities from develop through to production lifecycles.
- Be sure to Our IP Grantlist / Whitelist the DLH.io IP Addresses for any cloud vendor or databricks server or network access required
- Have created a Databricks personal access token following their documentation, https://docs.databricks.com/en/dev-tools/auth/pat.html or OAuth 2.0 Machine-to-Machine (M2M) access using the documentation, https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html
- Once a type of authorization is selected in DLH.io connection configuration, it cannot be changed for that connector. So, choose the method desired appropriately before proceeding.
- For M2M OAuth be sure to capture your Client ID, Client Secret, and Workspace URL for your Databricks target.
- If using a user's personal access token, the following permissions must be granted:
- CREATE
- MODIFY
- READ_METADATA
- SELECT
- USAGE
- if (optional) able to specify a catalog in the connection form that existing Catalog should allow the following permissions for the service principal/account:
- CREATE SCHEMA
- CREATE TABLE
- MODIFY
- SELECT
- USE CATALOG
- USE SCHEMA
- Have a databricks SQL Warehouse or Compute (Cluster/All-Purpose) created:
- If not yet created, follow databrick's documentation for creating a SQL Warehouse, https://docs.databricks.com/en/compute/sql-warehouse/create.html
- DLH.io does not currently support direct Unity Catalog target destination, only SQL Warehouse target loading
- For permissions for OAuth assign at a minimum the CAN USE permission.
- Create a specific new user in Databricks for DLH.io (recommended) / M2M configuration, or confirm an existing user is available with necessary permissions. To create a new user in Databricks:
- Log in to your Databricks account
- Open the workspace desired
- Create a new user and assign the appropriate permissions
- Create a PAT for the user following instructions here, https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users, because Personal Access Tokens are user specific. If using a type of "Service Account" user then you will need to log in as this user and follow the instructions: Developer > Access Tokens > Click "Manage"> Generate new Token > Click "Generate".
- Set permissions in the default (or selected (optional)) Catalog:
- Update the security permissions for the service provider/account or PAT user to have permissions to the Catalog:
- Click on the Catalog, then click Permissions > Click the Grant button
- Complete the form and finalize for the service provider/account or user by clicking the Grant button
-
Per the prerequisites, capture all the necessary details about your Databricks account and prepare to use them for creating your DLH.io Databricks connection.
In DLH.io from the header or sidebar chose Targets. Search for, or directly locate, and select Databricks from the list of connectors. A new target connection form will appear for you to enter details about your account/workspace that will be used for the data integration process.
- On the DLH.io Connection Form :: Enter your Connection Information
- Enter in the Name/Alias field, the name you'll use within DLH.io to differentiate this connection from others
- Enter in the Server Hostname field, the name of the endpoint server name:
- Use the endpoint as provided in your workspace, such as the Server Hostname which is similar to for example, ddd-asdfasw3-as00.cloud.databricks.com, that is unique to your workspace.
- (Optional) Enter in the Catalog field, the name of the existing Unity Catalog you wish to load data into. By default DLH.io will create a new Catalog and load your respective source data into it, thus it is important to provide authentication credentials with the necessary permissions.
- This option to specify an existing Unity Catalog for Business Critical plans only at this time.
- The Unity Catalog feature must be enabled in the workspace for this option to work correctly
- Enter in the Port field, where this account is accessible and the firewall/network rules allow DLH.io (See prerequisites for IP grantlisting). For this connector the default port is 443, which is standard but we have it listed here for future-proofing.
- Only change this if suggested by support or you have an advanced implemented or knowledge of your account's requirements
- Enter in the HTTP Path field, enter the HTTP Path provided in your workspace connection area for either your compute cluster or the SQL Warehouse.
- In the case of a compute cluster (or legacy) you will enter for example: sql/protocolv1/o/<orgId>/<clusterId> or sql/1.0/endpoints/<clusterId>, and for a SQL Warehouse option enter for example: or sql/1.0/warehouses/<warehouseId>
- Enter in the User or Client Id field, the username of the user account or service account you created (see prerequsites) that will authenticate your connection in DLH.io:
- If using a Personal Access Token or Username/Password (community edition only) authentication option, then this field will be the user's email address usually
- If using the OAuth authentication type (i.e.: M2M, etc.) then this field will be seen as the Client ID field
- Select Auth Type required from the dropdown.
- Use the Personal Access Token (PAT) option for legacy account environment or quick testing as development if not having administrative permissions to create the OAuth credentials. Databricks has mentioned that in future this option may be deprecated.
- Use the OAuth 2.0 option (recommended) for current account environment versions and newer configurations as personal access tokens may be deprecated in the near future
- Use the Username/Password option only for the Databricks community edition. This option may be removed at anytime from DLH.io, so it is recommended to use one of the other authentication options.
- Based on your Auth Type selection, the correct credential field will appear.
- Enter in the Password field, your personal access token (PAT) or Username/Password option, for the user for which you are connecting to your account
- If selecting OAuth 2.0 then the Client ID and Client Secret fields will appear an enter the appropriate credentials and information in the correct fields.
- Click on Save & Test to save the connection and test that we can connect.
- If updating the form Click Save & Test or just Test
- Clicking on Save & Test will again save any changes such as the password change, etc. Any test of the connection will attempt to connect to your database/account with the credentials and info provided.
- A message of success or failure will be shown:
- If success you'll be prompted with the schema objects objects of the database and will need to complete the final steps for configuration shown below.
- If failure happens with the test connection, the connection is still saved but you will need to correct the failure based on the failure reason information provided in the message
Before you can use OAuth to authenticate to Databricks, you must first create an OAuth secret, which can be used to generate OAuth access tokens. A service principal can have up to five OAuth secrets. Account admins and workspace admins can create an OAuth secret for a service principal.
The secret will only be revealed once during creation. The client ID is the same as the service principal’s application ID.
DLH.io will only use your credentials for access the workspace, focused on creating only the necessary objects required for the data synchronization process, otherwise read-only schema or data references, not to conduct operations at the account level.
To enable the service principal to use clusters or SQL warehouses, you must give the service principal access to them. See Compute permissions or Manage a SQL warehouse.
For a SQL Warehouse, provide the CAN USE permission at a minimum. To understand the security ACL, visit this link, https://docs.databricks.com/en/security/auth/access-control/index.html#sql-warehouses
For a Compute / Cluster add minimum permission of Can Restart.
The Can Restart permission is necessary if your Cluster terminates (auto shutsdown, etc.) and you need for the user or service provider account to start up the Computer when DLH.io sync bridges initiate the synchronization process. If your Compute is up most of the time and you wish to have a slightly lower security permission you can use the Can Attach To permision, however, DLH.io will not be able to re/start the Compute if it is down or terminated for any reason.
Failure to provide the correct permission level may result in the following error:
PERMISSION_DENIED: You do not have permission to autostart
or
PERMISSION_DENIED: User does not have USE SCHEMA on Schema