Databricks Archives - AzureOps

Automate Databricks Infrastructure as Code with Terraform

James Sandy — Fri, 04 Apr 2025 18:47:31 +0000

Terraform can be used to provision, manage, and scale Databricks environments; it allows infrastructure to be defined as code (IaC), enabling version control, reproducibility, and automation. Using Terraform to deploy Databricks workspace streamlines resource provisioning, access management, and workspace configuration to ensure consistency across environments and simplify scaling or modifications. This article describes how to Automate Databricks Infrastructure with Terraform.

Terraform Environment Setup

Databricks is a cloud-based tool built on Apache Spark used in machine learning, big data analytics, and data engineering to enable AI model deployment, fast data processing, and collaboration. Its scalable Delta Lake infrastructure supports its efficient, secure data management.

To manage Databricks infrastructure with Terraform, you have to install and configure some of the systems; the first step is installing Terraform and the Databricks provider. Download Terraform from the official website and ensure it is available in the system’s path, and then initialize a Terraform working directory by creating a new directory and running :

terraform init

Authentication in Databricks is configured using the credentials from cloud providers. A service principal is used to set up authentication for Azure:

provider "databricks" {
 host  = "https://"
 azure_client_id     = var.azure_client_id
 azure_client_secret = var.azure_client_secret
 azure_tenant_id     = var.azure_tenant_id
}

An Access key or service accounts will be required for AWS and GCP

Databricks Infrastructure with Terraform Setup

Terraform allows configuration using HCL (Hashicorp Configuration Language). and configurations are made in the primary configuration file, main.tf, declares a Databricks workspace:

resource "databricks_workspace" "example" {
 name          = "example-workspace"
 resource_group = "example-rg"
 location      = "East US"
}

Configuring clusters within a workspace requires some important setup, like versioning, autoscaling, and specific nodes:

resource "databricks_cluster" "example" {
 cluster_name            = "example-cluster"
 spark_version           = "12.0.x-scala2.12"
 node_type_id            = "Standard_D3_v2"
 autotermination_minutes = 20
 autoscale {
   min_workers = 2
   max_workers = 8
 }
}

Access Control and Permissions

Terraform will be able to control user roles and access control lists using the Databricks permission control user and group access

resource "databricks_group" "data_scientists" {
 display_name = "Data Scientists"
}

resource "databricks_user" "analyst" {
 user_name  = "analyst@example.com"
 groups     = [databricks_group.data_scientists.id]
}

Integrating Terraform with a cloud IAM system like Microsoft Entra Id ensures authentication and access policies are in line with enterprise governance best practices.

Deploying Databricks Jobs with Terraform

Databricks jobs are used to automate batch and streaming workloads; the databricks_job resources define the job execution parameters:

resource "databricks_job" "example_job" {
 name = "Example Job"
 new_cluster {
   spark_version = "12.0.x-scala2.12"
   node_type_id  = "Standard_D3_v2"
   num_workers   = 4
 }
 notebook_task {
   notebook_path = "/Shared/example_notebook"
 }
}

Scheduling jobs and dependencies are managed within Terraform, allowing seamless automation of recurring tasks.

Versioning with Terraform

Terraform uses a state file to track all deployed resources to ensure collaborations and prevent conflicts, so remote state storage must be configured.

terraform {
 backend "azurerm" {
   resource_group_name  = "terraform-state-rg"
   storage_account_name = "tfstateaccount"
   container_name       = "tfstate"
   key                  = "databricks.tfstate"
 }
}

This ensures changes made to infrastructure are managed safely using terraform plan and terraform apply, preventing unintended changes.

Automating Deployment with CI/CD

To ensure infrastructure with consistency, Terraform needs to integrate with CI/CD; a Github action pipeline can be used for automated deployments:

name: Terraform Deployment
on: [push]
jobs:
 deploy:
   runs-on: ubuntu-latest
   steps:
     - uses: actions/checkout@v3
     - name: Setup Terraform
       uses: hashicorp/setup-terraform@v1
     - name: Terraform Init
       run: terraform init
     - name: Terraform Apply
       run: terraform apply -auto-approve

This pipeline can also be configured in GitLab CI/CD or Azure DevOps for streamlined deployments.

Monitoring and Scaling Databricks infrastructure

Databricks environments require monitoring for performance and cost efficiency; Terraform can be used to configure monitoring solutions:

resource "databricks_instance_profile" "monitoring" {
 instance_profile_arn = "arn:aws:iam::123456789012:instance-profile/DatabricksMonitor"
}
Autoscaling is enabled on the clusters to optimize resource utilization without over-provisioning:
autoscale {
 min_workers = 4
 max_workers = 16
}

Errors and debugging Terraform Deployments

Some common Terraform errors include authentication failures and state mismatches. Provider authentication issues can be resolved by verifying permissions and credentials. State mismatches often require manual state adjustments:

terraform state rm databricks_cluster.example
terraform apply

Quota restrictions and API rate limits must be handled by adjusting Terraform configurations to include retries and back-off mechanisms.

Databricks Deployments Optimization.

More resources like Unity Catalog for data governance can be managed with Terraform:

resource "databricks_metastore" "unity" {
 name = "example-metastore"
}

MLflow tracking servers can also be set up to manage ML models and notebooks through Git integration, ensuring streamlined workflows.

This tutorial has shown how to automate Databricks infrastructure with Terraform, including infra provisioning, access control, job automation, CI/CD integration, and monitoring, but more optimizations like cost management, security best practices, and additional Databricks features can be included.

Pro tips:
1. Follow this guide to know how to manage secret scopes in databricks.

See more

Download Now

The post Automate Databricks Infrastructure as Code with Terraform appeared first on AzureOps.

Databricks VACUUM Command

Kunal Rathi — Tue, 17 Sep 2024 00:22:00 +0000

Databricks is a unified big data processing and analytics cloud platform for transforming and processing vast volumes of data. Apache Spark is the building block of Databricks, an in-memory analytics engine for big data and machine learning. In this article, we will see how to use the Databricks VACUUM command to remove unused files from the delta table.

What is VACUUM in the Delta table?

VACUUM empties all files from the table directory that are not managed by Delta and data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold.

How to use Databricks VACUUM on Databricks Delta tables

database_names_filter = "20_silver_zendesk_eg"
dbs = spark.sql(f"SHOW DATABASES LIKE '{database_names_filter}'").select("databaseName").collect()
dbs = [(row.databaseName) for row in dbs]
for database_name in dbs:
print(f"Found database: {database_name}, performing actions on all its tables..")
tables = spark.sql(f"SHOW TABLES FROM {database_name}").select("tableName").collect()
tables = [(row.tableName) for row in tables]
for table_name in tables:
print(f"performing vaccum on {table_name}")
spark.sql(f"ALTER TABLE {database_name}.{table_name} SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 2 days', 'delta.deletedFileRetentionDuration'='interval 1 days')")
spark.sql(f"VACUUM {database_name}.{table_name}")
spark.sql(f"ANALYZE TABLE {database_name}.{table_name} COMPUTE STATISTICS")

If you run VACUUM on a Delta table, you lose the ability to time-travel back to a version older than the specified data retention period.

See more

Download Now

The post Databricks VACUUM Command appeared first on AzureOps.

Kafka Streaming vs Spark Streaming

Kunal Rathi — Wed, 15 Feb 2023 18:39:39 +0000

Kafka Streaming and Spark streaming are distributed computing frameworks that allow the processing of real-time data streams. In this article, you will see some differences between Kafka Streaming vs. Spark Streaming.

What is Data Streaming?

Data Streaming is a method in which input is produced continuously to perform transformations. The output is also retrieved as a constant data stream, also called setting data in motion.

What is Kafka Stream?

Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. Kafka Streams (Kstreams) internally uses producer and consumer libraries. It is coupled with Kafka, and the API allows you to leverage the abilities of Kafka by achieving Data Parallelism, Fault-tolerance, low latency, and much more.

What is Spark Stream?

Spark stream is an extension of the core Spark API that provides scalable, high-throughput, fault-tolerant stream processing of live data streams. It allows real-time data processing from various sources like Kafka topics, Flume, Amazon Kinesis, etc. The processed data can be sink to file systems, databases, live dashboards, etc.

This article describes the difference between streaming part of Spark vs Kafka.

Key difference between Kafka streaming and Spark streaming

	Kafka Streaming	Spark Streaming
Technology stack	Kafka Streams is a Java library built on Apache Kafka, a distributed messaging system for real-time data streams.	Spark Streaming is a part of the Apache Spark ecosystem, a general-purpose big data processing engine.
Initial release	2016	2013
Processing model	Kafka Streams is a stream processing library that processes data records/events one at a time as they arrive in a stream. The processing logic assumes an independent record and some contextual/state information about the record. That limits the type of algorithms/computations you can implement in real-time.	Spark Streaming uses a micro-batch processing model, which simultaneously processes small batches of data records collected over time. The processing logic assumes you have all the related records available in the batch, allowing you to implement a wide range of algorithms/computations.
Fault tolerance	Kafka Streams leverages the built-in fault tolerance features of Kafka	Spark Streaming uses RDD (Resilient Distributed Datasets) to achieve fault tolerance.
Ease of use	Kafka Streams is known for its ease of use, as it has a simple and lightweight API designed to be developer-friendly.	Spark Streaming can be more complex to set up and configure, but it offers more features and tools for data processing and analysis.
Data sources and destinations	Can handle data from Kafka topics	Can handle data from Kafka topics and other sources like HDFS, AWS S3, data lakes, etc.
Integration	Kafka Streams is designed to work specifically with Kafka and requires a Kafka cluster to be set up.	It can run on various platforms, including Hadoop, Kubernetes, and Apache Mesos.
Managed cloud providers	Confluent, AWS MSK, Azure Event Hub, GCP Pub/Sub, etc.	DataBricks, AWS EMR, Azure HDInsight, GCP Dataproc, etc.
No-Code Low-Code API	kSQL	Spark SQL
When to go for	If your streaming application requires low latency processing of data from Kafka topics and you don’t need to process data from other sources,	If you need to process data from multiple sources or require a larger ecosystem and latency is not critical for your application.
Real-world examples	Airbnb: Airbnb uses Kafka Streams to process and analyze real-time data from their website, mobile applications, and other platforms to provide personalized recommendations to their users, optimize their operations, and detect fraudulent activities. Goldman Sachs: Goldman Sachs uses Kafka Streams to process and analyze real-time financial data from different sources to monitor their trading activities, detect anomalies, and optimize their trading strategies.	Uber: Uber uses Spark Streaming to process real-time data from their ride-hailing platform to monitor and improve the quality of their service, detect fraudulent activities, and optimize their operations. Netflix: Netflix uses Spark Streaming to analyze real-time customer data, monitor their streaming service, and perform real-time personalization to recommend personalized content to users.

Kafka Streaming vs Spark Streaming

Summary

Kafka Streams and Spark Streaming are potent tools for real-time data processing, but they have different strengths and weaknesses depending on the specific use case and requirements. All the above differences are based on my experiences and research and may not be accurate.

See more

Download Now

The post Kafka Streaming vs Spark Streaming appeared first on AzureOps.

Databricks Certified Data Engineer Associate

Kunal Rathi — Sun, 13 Nov 2022 22:50:47 +0000

Databricks is a unified big data processing and analytics cloud platform that transforms and processes enormous volumes of data. Apache Spark is the building block of Databricks, an in-memory analytics engine for big data and machine learning. This article has documented how to conquer the Databricks Certified Data Engineer Associate and Microsoft Azure data engineer certification (DP-203).
Databricks has introduced the Data Engineering Associate exam, which consists of 45 multiple-choice questions. The time slot allocated is 1.30 hr. The passing score is 32, which is 70%. High-level topics with their weightage in the exam are:

Learning Pathway

Source Databricks

Before you begin
1. You should understand Databricks, data engineering concepts, python, and SQL.
2. Databricks offers three certifications in the Data Engineering space. To attempt Data Engineer Associate Certification, learning Databricks Lakehouse Fundamentals and passing Databricks Lakehouse Fundamentals Accreditation is advisable.
3. Databricks data engineer exam and certification-related details on the Databricks website.
4. Databricks offers free exam vouchers to those who attend a three days Data Lake seminar organized by Databricks. Please watch out for the next event from here.

Prepare for the Exam

Sign in to the Databricks Partner Academy site for a self-paced video or paid instructor-led course using your office mail id.
Choose this course for this Databricks Certified Data Engineer Associate certification exam.

Study these items in detail:

Databricks Lakehouse Platform
What is Lakehouse
ELT with Spark SQL and Python
Learn to perform Extract Load and Transform on data using Spark SQL and Python in Databricks.
Learn ELT with spark and python
Incremental Data Processing
Structured Streaming
Auto Loader
Delta Live Tables
Production Pipelines
Data Governance

Databricks Certified Data Engineer Associate Practice Exam

To familiarize yourself with Databricks data engineer associate exam questions, please attempt the Practice Exam prepared by Databricks to glimpse actual exam content standards and difficulty levels.

Once you are done with it and understand your preparation, I recommend practicing tests by Akhil V and Certification Champs on Udemy.

Book Your Exam Slot

If your practice test exam score is at least 90%, you can book your actual slot here.

Pro tips:
1. Community Edition of Databricks doesn’t cover all the topics for this exam. You may need to use paid Databricks with Azure or AWS. The smallest possible capacity might suffice learning requirements.
2. If you’re aiming for Developing Solutions for Microsoft Azure certification, take a look at these helpful tips.

See more

Download Now

The post Databricks Certified Data Engineer Associate appeared first on AzureOps.

Databricks Secret Scopes – How to Create, Manage, and Use Securely

Pavan Bangad — Wed, 14 Sep 2022 20:42:54 +0000

Databricks is a unified big data processing and analytics cloud platform that transforms and processes huge volumes of data. Apache Spark is the building block of Databricks, an in-memory analytics engine for big data and machine learning. Databricks can connect to various sources for data ingestion. This article describes how to manage secret scopes in Databricks using GUI.

Pre-requisites:
To mount a location, you would need:
1. Databricks service in Azure, GCP, or AWS cloud.
2. A Databricks cluster.
3. Azure subscription with Azure Key Vault service created.

What are Secret scopes in Databricks?

When working with various applications, the Databricks platform comes in handy. To establish connections, credentials or secrets are necessary, which can be securely stored in Databricks or Azure Key Vault. Secret scopes are responsible for managing these secrets in either Azure Key Vault or Databricks.y Vault or Databricks.y Vault or Databricks.

Databricks supports two secret scopes :
1. Azure Key Vault backed scopes: to manage secrets stored in the Azure Key Vault.
2. Databricks-backed scopes: to manage secrets stored in Databricks.

Secret Scopes vs Key Vault-Backed Scopes

Feature	Secret Scope	Key Vault-Backed Scope
Storage	Stored inside Databricks workspace	Stored in Azure Key Vault
Security	Basic workspace-level	Enterprise-grade (RBAC + auditing)
Ideal For	Simpler, internal use	Production, regulated environments
Creation	`databricks secrets create-scope`	Linked via Azure Key Vault URI

This article will focus on how to manage Azure Key Vault-backed secret scopes.

Create an Azure Key Vault-backed scope

Follow the below steps to create an Azure Key Vault-backed secret scope.

1. Open https://#secrets/createScope URL

2. Provide the below details:

Scope Name:

Manage Principal: Using this option, you can specify what all users can manage the secret scope. You can either select “All Users” or “Create’.

DNS Name and Resource ID: Both these properties can be found in Azure Key Vault service properties.

3. Click on Create. This will create secret scope.

Access a secret from the Azure Key Vault in Databricks

We can access secrets in Databricks using the below command.

password =  dbutils.secrets.get(scope = , key = ")

Delete a secret scope from Databricks

Unfortunately, it is not possible to delete a secret scope using GUI. The alternative option is to use either Databricks CLI or Databricks Rest API for deletion.

FAQ

1. How to view secret scope values in Databricks?

You cannot view actual secret values for security reasons. However, you can list the scope name and keys using: databricks secrets list –scope my-scope

Can I delete a Databricks secret scope?

Yes, using: databricks secrets delete-scope –scope my-scope

Pro tips:
1. Databricks provides a free community version where you can learn and explore Databricks. You can signup here.
2. By managing secret scopes in Databricks, you can keep your sensitive data secure while allowing authorized users and applications to access it when needed.
3. If you’re aiming to obtain the Databricks certified Data Engineer Associate certification, take a look at these helpful tips.
4. Learn how to mount and unmount data lake gen2 storage in Databricks.
5. Learn how to automate Databricks IAAC using Terraform.

Download Now

The post Databricks Secret Scopes – How to Create, Manage, and Use Securely appeared first on AzureOps.

Mount and Unmount Data Lake in Databricks

Pavan Bangad — Wed, 17 Aug 2022 08:00:00 +0000

Databricks is a unified big data processing and analytics cloud platform that transforms and processes huge volumes of data. Apache Spark is the building block of Databricks, an in-memory analytics engine for big data and machine learning. Databricks can connect to various sources for data ingestion. This article will show how to mount and unmount data lake in Databricks.

Pre-requisites:
To mount a location, you would need the following:
1. Databricks service in Azure, GCP, or AWS cloud.
2. A Databricks cluster.
3. A basic understanding of Databricks and how to create notebooks.

What is Mounting in Databricks?

Mounting object storage to DBFS allows easy access to object storage as if they were on the local file system. Once a location e.g., blob storage or Amazon S3 bucket is mounted, we can use the same mount location to access the external drive.

Generally, we use dbutils.fs.mount() command to mount a location in Databricks.

How to mount a data lake in Databricks?

Let us now see how to mount Azure data lake gen2 in Databricks.

First thing first, let’s create blob storage and container. Blob storage should look like in the below image.

New Container should look like in the below image.

To mount an ADLS gen2 we will need the below details to connect to a location.

ContainerName = "yourcontainerName"
azure_blobstorage_name = "blobstoragename"
mountpointname = "/mnt/azureops"
secret_key ="xxxxxxxxxxx"

Once we have this information, we can use below code snippet to connect the data lake with Databricks.

dbutils.fs.mount(source = f"wasbs://{ContainerName}@{azure_blobstorage_name}.blob.core.windows.net",mount_point = Mountpointname ,extra_configs = {"fs.azure.account.key."+azure_blobstorage_name+".blob.core.windows.net":secret_key})

How to check all the mount points in Databricks?

dbutils.fs.mounts()

How to unmount a location?

dbutils.fs.unmount(mount_point)

Let’s use all the above commands in action.

The objective is to add a mount point if it does not exist.

if all(mount.mountPoint != archival_mount_name for mount in dbutils.fs.mounts()):
     dbutils.fs.mount(source = f"wasbs://{ContainerName}@{azure_blobstorage_name}.blob.core.windows.net",mount_point = Mountpointname ,extra_configs = {"fs.azure.account.key."+azure_blobstorage_name+".blob.core.windows.net":archival_secret_key})

Pro tips:
1. Instead of using a storage account key, we can also mount a location using a SAS token URL or service principal
2. Databricks provide a free community version where you can learn and explore Databricks. You can signup here.
3. If you’re aiming to obtain the Databricks certified Data Engineer Associate certification, take a look at these helpful tips.

Notebook Reference

mount_unmount Download

See more

Download Now

The post Mount and Unmount Data Lake in Databricks appeared first on AzureOps.

Call a notebook from another notebook in Databricks

Pavan Bangad — Wed, 03 Aug 2022 12:43:55 +0000

What is Databricks notebook and execution context?

Notebooks in Databricks are used to write spark code to process and transform data. Notebooks support Python, Scala, SQL, and R languages.

Whenever we execute a notebook in Databricks, it attaches a cluster (computation resource) to it and creates an execution context.

Pre-requisites:
If you want to run Databricks notebook inside another notebook, you would need the following:
1. Databricks service in Azure, GCP, or AWS cloud.
2. A Databricks cluster.
3. A basic understanding of Databricks and how to create notebooks.

Methods to call a notebook from another notebook in Databricks

There are two methods to run a Databricks notebook inside another Databricks notebook.

1. Using the %run command

%run command invokes the notebook in the same notebook context, meaning any variable or function declared in the parent notebook can be used in the child notebook.

The sample command would look like the one below.

%run [notebook path] $paramter1="Value1" $paramterN="valueN"

Example – Use the %run function to call a notebook inside another notebook.

This method is suitable for defining a notebook with all the constant variables or a centralized shared function library. And you want to refer to them in the calling or child notebook.

What if we need to execute the child’s notebook in a different notebook context? The following method describes how to achieve this.

2. Using the dbutils.notebook.run() function

This function will run the notebook in a new notebook context.

The syntax of dbutils.notebook function is:

dbutils.notebook.run(notebookpath, timeout_in_seconds, parameters)

Here,

Notebook_path -> path of the target notebook.
Timeout_in_seconds – > the notebook will throw an exception if it is not completed in the specified time.
parameters – > Used to send parameters to child notebook. Parameters should be specified in JSON format.
e.g. {‘paramter1’: ‘value1’, ‘paramter2’: ‘value2’}

Example – Use dbutils.notebook.run() to call a notebook inside another notebook.

We can call the N numbers of the notebook by calling this function in the parent notebook

This will run all the notebooks sequentially.

Run Databricks notebooks in parallel

You can use the python library to run multiple Databricks notebooks in parallel. This library helps create multiple threads that run notebooks in parallel.

Import the library as follows:

from concurrent.futures import ThreadPoolExecutor

You can read more about ThreadPoolExecutor here. And here is a sample code that code explorer already wrote for running the notebook in parallel.

Attaching the same notebook used in this blog:

notebook_run Download

Pro tips:
1. We can use Azure data factory for running notebooks in parallel. Refer to this post to learn more.
2. Jobs created using the dbutils.notebook API must complete in within 30 days.
3. We can only pass string parameters to the child notebook with the methods described in this article, and objects are not allowed.
4. Databricks provide a free community version where you can learn and explore Databricks. You can sign up here.

See more

Download Now

The post Call a notebook from another notebook in Databricks appeared first on AzureOps.