Databricks: File Storage Options on Databricks

The main file storage options in Databricks are:

  • Unity Catalog Volumes: Recommended for storing structured, semi-structured, and unstructured data, libraries, build artifacts, and configuration files. Offers robust governance, fine-grained access control, cross-workspace accessibility, and direct cloud storage integration (S3, Azure ADLS, GCS). Suitable for large files and supports audit logging.
  • Workspace Files: Intended for notebooks, SQL queries, source code files, and small project data files (usually <500MB). Access and permissions are limited to a single workspace. Useful for temporary or development artifacts; supports Git folder integration for version control.
  • Databricks File System (DBFS): Distributed file system abstraction layered over cloud object storage. Provides a unified, Unix-like interface for all clusters; holds files in directories such as /FileStore, /databricks-datasets, and /user/hive/warehouse. DBFS is not recommended for new workflows due to limited security controls (all workspace users have access) and governance features.
  • Direct Cloud Object Storage Access: Use native protocols (such as abfss:// for Azure, s3:// for AWS, gs:// for Google Cloud) to read/write files directly in object stores—usually governed via Unity Catalog external locations.
  • External Locations (via Unity Catalog): Securely register cloud storage locations for creating and governing external tables and file access. Best practice for production systems needing strong security and compliance.
  • Mount Points (/mnt, legacy): Old method of mounting external storage into the DBFS namespace (e.g., S3 buckets, ADLS containers). Deprecated in favor of Unity Catalog volumes and direct access.
OptionBest Use CaseSecurity/GovernanceNotes
Unity Catalog VolumesData, artifacts across workspacesStrongRecommended, scalable
Workspace FilesNotebooks, code, small filesWorkspace ACLsLimited to one workspace
DBFS Root & FoldersLegacy, temp, example datasetsBasicNot recommended for prod
Direct Cloud Storage (abfss/s3/gs)High-performance, large datasetsGoverned by UCPreferred for new workloads
External LocationsTables/files on cloud storageStrong (via UC)Full audit, compliance
Mount PointsLegacy scripts, migrationBasicDeprecated

For new and production-grade workloads, prefer Unity Catalog volumes, external locations, or direct cloud storage access; use workspace files for development and temporary needs. Avoid DBFS root and mount points for sensitive or critical data.

Example of Each File Storage Option in Databricks

Here are practical examples for each main Databricks file storage option, demonstrating how you’d store, access, or manage files using these systems.


1. Unity Catalog Volumes

Create a volume and write a file to it with Python:

python# Create a Unity Catalog volume via SQL (admin required)
CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.my_volume 
COMMENT 'Example volume';

# Write to the volume in a notebook
with open('/Volumes/my_catalog/my_schema/my_volume/example.txt', 'w') as f:
    f.write('Unity Catalog Volume Example')
  • Access File: /Volumes/my_catalog/my_schema/my_volume/example.txt

2. Workspace Files

Upload or create a small file in the workspace (notebook or UI):

  • In the Databricks UI, go to Workspace > Files and upload demo.txt.
  • Access File: Use in notebooks as /Workspace/Files/demo.txt
python# Read from workspace file in a notebook
with open('/Workspace/Files/demo.txt', 'r') as f:
    print(f.read())

3. Databricks File System (DBFS)

Store and read a file in DBFS:

python# Save a file to DBFS (e.g., /FileStore)
dbutils.fs.put("/FileStore/my_example.txt", "DBFS example data", True)

# Read the file back
display(dbutils.fs.head('/FileStore/my_example.txt'))
  • Access File: dbfs:/FileStore/my_example.txt

4. Direct Cloud Object Storage Access (abfss, s3, gs)

Read a file directly from Azure Data Lake Storage Gen2 (example for abfss):

python# Load a CSV directly from ADLS Gen2
df = spark.read.csv("abfss://mycontainer@mystorageaccount.dfs.core.windows.net/mydata/myfile.csv")
df.show()
  • Access File: abfss://..., s3://..., or gs://...

5. External Locations (Unity Catalog)

Create an external location, then create a table from it:

sql-- Register your external location (once admin sets up storage credential)
CREATE EXTERNAL LOCATION my_ext_loc
  URL 'abfss://container@account.dfs.core.windows.net/folder/'
  WITH (STORAGE CREDENTIAL my_credential);

-- Create an external table using the registered location
CREATE TABLE my_catalog.my_schema.ext_table
LOCATION 'abfss://container@account.dfs.core.windows.net/folder/data/';
  • Access Table: Governed by Unity Catalog, referencing external cloud storage.

6. Mount Points (/mnt, legacy)

(Deprecated; not for new projects, but still seen in older scripts)

python# Mount an external storage (older pattern)
dbutils.fs.mount(
  source = "wasbs://container@account.blob.core.windows.net/",
  mount_point = "/mnt/my_mount",
  extra_configs = {"fs.azure.account.key.account.blob.core.windows.net": "key"}
)

# Access file from mount
dbutils.fs.ls("/mnt/my_mount/data/")
  • Access File: /mnt/my_mount/data/

Summary Table: Examples

Storage OptionExample Path/UsageCode/SQL Example
Unity Catalog Volume/Volumes/my_catalog/my_schema/my_volume/fileCreate volume, Python
Workspace Files/Workspace/Files/demo.txtPython
DBFSdbfs:/FileStore/my_example.txtdbutils.fs API
Direct Cloud Storageabfss://container@account/..., s3://...spark.read, SQL
External Locations (UC)Registered cloud path in Unity CatalogCREATE EXTERNAL LOCATION
Mount Points (/mnt)/mnt/my_mount/data/dbutils.fs.mount

Each storage solution fits distinct needs for governance, sharing, scalability, and compatibility in Databricks workflows.

Leave a Comment