The main file storage options in Databricks are:
- Unity Catalog Volumes: Recommended for storing structured, semi-structured, and unstructured data, libraries, build artifacts, and configuration files. Offers robust governance, fine-grained access control, cross-workspace accessibility, and direct cloud storage integration (S3, Azure ADLS, GCS). Suitable for large files and supports audit logging.
- Workspace Files: Intended for notebooks, SQL queries, source code files, and small project data files (usually <500MB). Access and permissions are limited to a single workspace. Useful for temporary or development artifacts; supports Git folder integration for version control.
- Databricks File System (DBFS): Distributed file system abstraction layered over cloud object storage. Provides a unified, Unix-like interface for all clusters; holds files in directories such as
/FileStore
,/databricks-datasets
, and/user/hive/warehouse
. DBFS is not recommended for new workflows due to limited security controls (all workspace users have access) and governance features. - Direct Cloud Object Storage Access: Use native protocols (such as
abfss://
for Azure,s3://
for AWS,gs://
for Google Cloud) to read/write files directly in object stores—usually governed via Unity Catalog external locations. - External Locations (via Unity Catalog): Securely register cloud storage locations for creating and governing external tables and file access. Best practice for production systems needing strong security and compliance.
- Mount Points (
/mnt
, legacy): Old method of mounting external storage into the DBFS namespace (e.g., S3 buckets, ADLS containers). Deprecated in favor of Unity Catalog volumes and direct access.
Option | Best Use Case | Security/Governance | Notes |
---|---|---|---|
Unity Catalog Volumes | Data, artifacts across workspaces | Strong | Recommended, scalable |
Workspace Files | Notebooks, code, small files | Workspace ACLs | Limited to one workspace |
DBFS Root & Folders | Legacy, temp, example datasets | Basic | Not recommended for prod |
Direct Cloud Storage (abfss/s3/gs) | High-performance, large datasets | Governed by UC | Preferred for new workloads |
External Locations | Tables/files on cloud storage | Strong (via UC) | Full audit, compliance |
Mount Points | Legacy scripts, migration | Basic | Deprecated |
For new and production-grade workloads, prefer Unity Catalog volumes, external locations, or direct cloud storage access; use workspace files for development and temporary needs. Avoid DBFS root and mount points for sensitive or critical data.
Example of Each File Storage Option in Databricks
Here are practical examples for each main Databricks file storage option, demonstrating how you’d store, access, or manage files using these systems.
1. Unity Catalog Volumes
Create a volume and write a file to it with Python:
python# Create a Unity Catalog volume via SQL (admin required)
CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.my_volume
COMMENT 'Example volume';
# Write to the volume in a notebook
with open('/Volumes/my_catalog/my_schema/my_volume/example.txt', 'w') as f:
f.write('Unity Catalog Volume Example')
- Access File:
/Volumes/my_catalog/my_schema/my_volume/example.txt
2. Workspace Files
Upload or create a small file in the workspace (notebook or UI):
- In the Databricks UI, go to Workspace > Files and upload
demo.txt
. - Access File: Use in notebooks as
/Workspace/Files/demo.txt
python# Read from workspace file in a notebook
with open('/Workspace/Files/demo.txt', 'r') as f:
print(f.read())
3. Databricks File System (DBFS)
Store and read a file in DBFS:
python# Save a file to DBFS (e.g., /FileStore)
dbutils.fs.put("/FileStore/my_example.txt", "DBFS example data", True)
# Read the file back
display(dbutils.fs.head('/FileStore/my_example.txt'))
- Access File:
dbfs:/FileStore/my_example.txt
4. Direct Cloud Object Storage Access (abfss, s3, gs)
Read a file directly from Azure Data Lake Storage Gen2 (example for abfss):
python# Load a CSV directly from ADLS Gen2
df = spark.read.csv("abfss://mycontainer@mystorageaccount.dfs.core.windows.net/mydata/myfile.csv")
df.show()
- Access File:
abfss://...
,s3://...
, orgs://...
5. External Locations (Unity Catalog)
Create an external location, then create a table from it:
sql-- Register your external location (once admin sets up storage credential)
CREATE EXTERNAL LOCATION my_ext_loc
URL 'abfss://container@account.dfs.core.windows.net/folder/'
WITH (STORAGE CREDENTIAL my_credential);
-- Create an external table using the registered location
CREATE TABLE my_catalog.my_schema.ext_table
LOCATION 'abfss://container@account.dfs.core.windows.net/folder/data/';
- Access Table: Governed by Unity Catalog, referencing external cloud storage.
6. Mount Points (/mnt, legacy)
(Deprecated; not for new projects, but still seen in older scripts)
python# Mount an external storage (older pattern)
dbutils.fs.mount(
source = "wasbs://container@account.blob.core.windows.net/",
mount_point = "/mnt/my_mount",
extra_configs = {"fs.azure.account.key.account.blob.core.windows.net": "key"}
)
# Access file from mount
dbutils.fs.ls("/mnt/my_mount/data/")
- Access File:
/mnt/my_mount/data/
Summary Table: Examples
Storage Option | Example Path/Usage | Code/SQL Example |
---|---|---|
Unity Catalog Volume | /Volumes/my_catalog/my_schema/my_volume/file | Create volume, Python |
Workspace Files | /Workspace/Files/demo.txt | Python |
DBFS | dbfs:/FileStore/my_example.txt | dbutils.fs API |
Direct Cloud Storage | abfss://container@account/... , s3://... | spark.read, SQL |
External Locations (UC) | Registered cloud path in Unity Catalog | CREATE EXTERNAL LOCATION |
Mount Points (/mnt) | /mnt/my_mount/data/ | dbutils.fs.mount |
Each storage solution fits distinct needs for governance, sharing, scalability, and compatibility in Databricks workflows.