πΉ 1. Introduction
In Databricks, we usually store tabular data in Delta tables (structured data).
But what about:
- Unstructured (images, logs, videos)
- Semi-structured (JSON, CSV, XML)
- Other structured files (Parquet, ORC)
π For these, Databricks introduces Volumes, which provide a governed, secure storage layer managed by Unity Catalog.
Key Requirements
- Unity Catalog enabled in your Databricks workspace.
- Databricks Runtime 13.3 LTS or above.
πΉ 2. What are Volumes?
- Volumes are part of the Unity Catalog hierarchy:
Metastore β Catalog β Schema β Volume
- Just like tables, Volumes store files but are designed for file-based data.
- Volumes are governed by Unity Catalog policies (ACLs, permissions).
πΉ 3. Types of Volumes
Just like tables, Volumes come in two flavors:
- Managed Volume
- Data location managed by Unity Catalog.
- Files are stored in the default managed storage.
- If you drop the volume β both data + metadata are deleted.
- External Volume
- Points to an external location (e.g., Azure Data Lake, S3, GCS).
- Requires external location + storage credential.
- If you drop the volume β only metadata is deleted, files remain.
πΉ 4. Create External Location (for External Volume)
Before creating an External Volume, you must configure an External Location.
Step 1: Create a folder in Azure Storage
- Storage Account:
adbewithdata01
- Container:
data
- Folder:
adb/ext_volume
Step 2: Create External Location in Databricks (UI or SQL)
Using UI:
- Go to Catalog Explorer > External Locations > Create
- Example:
- Name β
ext_volume
- Credential β
sc_catalog_storage
- Path β
abfss://data@adbwithdata01.dfs.core.windows.net/adb/ext_volume
- Test connection β β Success
- Name β
Using SQL:
CREATE EXTERNAL LOCATION ext_volume
URL 'abfss://data@adbwithdata01.dfs.core.windows.net/adb/ext_volume'
WITH STORAGE CREDENTIAL sc_catalog_storage
COMMENT 'This is for external volume';
πΉ 5. Create a Managed Volume
Letβs create a managed volume in the dev.bronze
schema.
CREATE VOLUME dev.bronze.managed_volume
COMMENT 'This is a managed volume';
π Key point:
- No
LOCATION
specified β Unity Catalog decides storage path. - Data stored under metastore-managed location.
Check volume details:
DESCRIBE VOLUME dev.bronze.managed_volume;
Output shows:
- Location (metastore path)
- Type = MANAGED
πΉ 6. Using Volumes with File Paths
When accessing volumes with dbutils.fs or %sh
, you must use a special path format:
/Volumes/<catalog>/<schema>/<volume>/<subfolder>/<file>
Example:
/Volumes/dev/bronze/managed_volume/files/emp.csv
πΉ 7. Example: Copy Files into Managed Volume
Step 1: Download a CSV
%sh
wget https://raw.githubusercontent.com/databricks/Spar02Hero-Datasets/main/emp.csv
ls -ltr
pwd
Assume file is saved at /databricks/driver/emp.csv
.
Step 2: Create a folder inside Volume
dbutils.fs.mkdirs("/Volumes/dev/bronze/managed_volume/files")
Step 3: Copy file into Volume
dbutils.fs.cp("file:/databricks/driver/emp.csv",
"/Volumes/dev/bronze/managed_volume/files/emp.csv")
Step 4: Query file directly
SELECT *
FROM csv.`/Volumes/dev/bronze/managed_volume/files/emp.csv`;
β You can now read structured data (CSV, JSON, Parquet) stored in your volume.
πΉ 8. Create an External Volume
Now letβs create an external volume that points to the external location we created earlier.
CREATE EXTERNAL VOLUME dev.bronze.external_volume
LOCATION 'abfss://data@adbwithdata01.dfs.core.windows.net/adb/ext_volume'
COMMENT 'External volume for semi/unstructured data';
Check details:
DESCRIBE VOLUME dev.bronze.external_volume;
- Type = EXTERNAL
- Location = Azure path provided
Step 1: Create a folder inside external volume
dbutils.fs.mkdirs("/Volumes/dev/bronze/external_volume/files")
Step 2: Copy file into external volume
dbutils.fs.cp("file:/databricks/driver/emp.csv",
"/Volumes/dev/bronze/external_volume/files/emp.csv")
Step 3: Verify in Azure Portal
- Navigate to
adb/ext_volume/files/emp.csv
- File is now available outside Databricks too.
πΉ 9. Drop a Volume
- Managed Volume β drops data + metadata.
- External Volume β drops only metadata; files remain in storage.
Example:
-- Drop external volume
DROP VOLUME dev.bronze.external_volume;
-- Files still exist in Azure
If you recreate the volume pointing to the same location:
CREATE EXTERNAL VOLUME dev.bronze.external_volume
LOCATION 'abfss://data@adbwithdata01.dfs.core.windows.net/adb/ext_volume';
π Files reappear inside Databricks.
πΉ 10. Summary
- Volumes allow Databricks to govern files (structured/unstructured) under Unity Catalog.
- Managed Volume β fully controlled by Databricks, data removed on drop.
- External Volume β points to external storage, dropping only removes metadata.
- File access always via
/Volumes/<catalog>/<schema>/<volume>/...
. - You can read, write, copy, and query files in volumes with SQL or dbutils.