{"id":423,"date":"2025-08-11T15:00:00","date_gmt":"2025-08-11T15:00:00","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/?p=423"},"modified":"2025-08-11T15:00:02","modified_gmt":"2025-08-11T15:00:02","slug":"databricks-file-storage-options-on-databricks","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/databricks-file-storage-options-on-databricks\/","title":{"rendered":"Databricks: File Storage Options on Databricks"},"content":{"rendered":"\n<p>The main file storage options in Databricks are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Unity Catalog Volumes:<\/strong> Recommended for storing structured, semi-structured, and unstructured data, libraries, build artifacts, and configuration files. Offers robust governance, fine-grained access control, cross-workspace accessibility, and direct cloud storage integration (S3, Azure ADLS, GCS). Suitable for large files and supports audit logging.<\/li>\n\n\n\n<li><strong>Workspace Files:<\/strong> Intended for notebooks, SQL queries, source code files, and small project data files (usually &lt;500MB). Access and permissions are limited to a single workspace. Useful for temporary or development artifacts; supports Git folder integration for version control.<\/li>\n\n\n\n<li><strong>Databricks File System (DBFS):<\/strong> Distributed file system abstraction layered over cloud object storage. Provides a unified, Unix-like interface for all clusters; holds files in directories such as <code>\/FileStore<\/code>, <code>\/databricks-datasets<\/code>, and <code>\/user\/hive\/warehouse<\/code>. DBFS is not recommended for new workflows due to limited security controls (all workspace users have access) and governance features.<\/li>\n\n\n\n<li><strong>Direct Cloud Object Storage Access:<\/strong> Use native protocols (such as <code>abfss:\/\/<\/code> for Azure, <code>s3:\/\/<\/code> for AWS, <code>gs:\/\/<\/code> for Google Cloud) to read\/write files directly in object stores\u2014usually governed via Unity Catalog external locations.<\/li>\n\n\n\n<li><strong>External Locations (via Unity Catalog):<\/strong> Securely register cloud storage locations for creating and governing external tables and file access. Best practice for production systems needing strong security and compliance.<\/li>\n\n\n\n<li><strong>Mount Points (<code>\/mnt<\/code>, legacy):<\/strong> Old method of mounting external storage into the DBFS namespace (e.g., S3 buckets, ADLS containers). Deprecated in favor of Unity Catalog volumes and direct access.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Option<\/th><th>Best Use Case<\/th><th>Security\/Governance<\/th><th>Notes<\/th><\/tr><\/thead><tbody><tr><td><strong>Unity Catalog Volumes<\/strong><\/td><td>Data, artifacts across workspaces<\/td><td>Strong<\/td><td>Recommended, scalable<\/td><\/tr><tr><td><strong>Workspace Files<\/strong><\/td><td>Notebooks, code, small files<\/td><td>Workspace ACLs<\/td><td>Limited to one workspace<\/td><\/tr><tr><td><strong>DBFS Root &amp; Folders<\/strong><\/td><td>Legacy, temp, example datasets<\/td><td>Basic<\/td><td>Not recommended for prod<\/td><\/tr><tr><td><strong>Direct Cloud Storage (abfss\/s3\/gs)<\/strong><\/td><td>High-performance, large datasets<\/td><td>Governed by UC<\/td><td>Preferred for new workloads<\/td><\/tr><tr><td><strong>External Locations<\/strong><\/td><td>Tables\/files on cloud storage<\/td><td>Strong (via UC)<\/td><td>Full audit, compliance<\/td><\/tr><tr><td><strong>Mount Points<\/strong><\/td><td>Legacy scripts, migration<\/td><td>Basic<\/td><td>Deprecated<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>For new and production-grade workloads, prefer <strong>Unity Catalog volumes<\/strong>, <strong>external locations<\/strong>, or <strong>direct cloud storage access<\/strong>; use <strong>workspace files<\/strong> for development and temporary needs. Avoid DBFS root and mount points for sensitive or critical data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"example-of-each-file-storage-option-in-databricks\">Example of Each File Storage Option in Databricks<\/h2>\n\n\n\n<p>Here are practical examples for each main Databricks file storage option, demonstrating how you&#8217;d store, access, or manage files using these systems.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">1. <strong>Unity Catalog Volumes<\/strong><\/h2>\n\n\n\n<p>Create a volume and write a file to it with Python:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code><em># Create a Unity Catalog volume via SQL (admin required)<\/em>\nCREATE VOLUME IF NOT EXISTS my_catalog.my_schema.my_volume \nCOMMENT 'Example volume';\n\n<em># Write to the volume in a notebook<\/em>\nwith open('\/Volumes\/my_catalog\/my_schema\/my_volume\/example.txt', 'w') as f:\n    f.write('Unity Catalog Volume Example')\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Access File:<\/strong> <code>\/Volumes\/my_catalog\/my_schema\/my_volume\/example.txt<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">2. <strong>Workspace Files<\/strong><\/h2>\n\n\n\n<p>Upload or create a small file in the workspace (notebook or UI):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In the Databricks UI, go to <strong>Workspace > Files<\/strong> and upload <code>demo.txt<\/code>.<\/li>\n\n\n\n<li><strong>Access File:<\/strong> Use in notebooks as <code>\/Workspace\/Files\/demo.txt<\/code><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code><em># Read from workspace file in a notebook<\/em>\nwith open('\/Workspace\/Files\/demo.txt', 'r') as f:\n    print(f.read())\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">3. <strong>Databricks File System (DBFS)<\/strong><\/h2>\n\n\n\n<p>Store and read a file in DBFS:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code><em># Save a file to DBFS (e.g., \/FileStore)<\/em>\ndbutils.fs.put(\"\/FileStore\/my_example.txt\", \"DBFS example data\", True)\n\n<em># Read the file back<\/em>\ndisplay(dbutils.fs.head('\/FileStore\/my_example.txt'))\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Access File:<\/strong> <code>dbfs:\/FileStore\/my_example.txt<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">4. <strong>Direct Cloud Object Storage Access (abfss, s3, gs)<\/strong><\/h2>\n\n\n\n<p>Read a file directly from Azure Data Lake Storage Gen2 (example for abfss):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code><em># Load a CSV directly from ADLS Gen2<\/em>\ndf = spark.read.csv(\"abfss:\/\/mycontainer@mystorageaccount.dfs.core.windows.net\/mydata\/myfile.csv\")\ndf.show()\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Access File:<\/strong> <code>abfss:\/\/...<\/code>, <code>s3:\/\/...<\/code>, or <code>gs:\/\/...<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">5. <strong>External Locations (Unity Catalog)<\/strong><\/h2>\n\n\n\n<p>Create an external location, then create a table from it:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sql<code><em>-- Register your external location (once admin sets up storage credential)<\/em>\nCREATE EXTERNAL LOCATION my_ext_loc\n  URL 'abfss:\/\/container@account.dfs.core.windows.net\/folder\/'\n  WITH (STORAGE CREDENTIAL my_credential);\n\n<em>-- Create an external table using the registered location<\/em>\nCREATE TABLE my_catalog.my_schema.ext_table\nLOCATION 'abfss:\/\/container@account.dfs.core.windows.net\/folder\/data\/';\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Access Table:<\/strong> Governed by Unity Catalog, referencing external cloud storage.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">6. <strong>Mount Points (\/mnt, legacy)<\/strong><\/h2>\n\n\n\n<p>(Deprecated; not for new projects, but still seen in older scripts)<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code><em># Mount an external storage (older pattern)<\/em>\ndbutils.fs.mount(\n  source = \"wasbs:\/\/container@account.blob.core.windows.net\/\",\n  mount_point = \"\/mnt\/my_mount\",\n  extra_configs = {\"fs.azure.account.key.account.blob.core.windows.net\": \"key\"}\n)\n\n<em># Access file from mount<\/em>\ndbutils.fs.ls(\"\/mnt\/my_mount\/data\/\")\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Access File:<\/strong> <code>\/mnt\/my_mount\/data\/<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"summary-table-examples\">Summary Table: Examples<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Storage Option<\/th><th>Example Path\/Usage<\/th><th>Code\/SQL Example<\/th><\/tr><\/thead><tbody><tr><td>Unity Catalog Volume<\/td><td><code>\/Volumes\/my_catalog\/my_schema\/my_volume\/file<\/code><\/td><td>Create volume, Python<\/td><\/tr><tr><td>Workspace Files<\/td><td><code>\/Workspace\/Files\/demo.txt<\/code><\/td><td>Python<\/td><\/tr><tr><td>DBFS<\/td><td><code>dbfs:\/FileStore\/my_example.txt<\/code><\/td><td>dbutils.fs API<\/td><\/tr><tr><td>Direct Cloud Storage<\/td><td><code>abfss:\/\/container@account\/...<\/code>, <code>s3:\/\/...<\/code><\/td><td>spark.read, SQL<\/td><\/tr><tr><td>External Locations (UC)<\/td><td>Registered cloud path in Unity Catalog<\/td><td>CREATE EXTERNAL LOCATION<\/td><\/tr><tr><td>Mount Points (\/mnt)<\/td><td><code>\/mnt\/my_mount\/data\/<\/code><\/td><td>dbutils.fs.mount<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Each storage solution fits distinct needs for governance, sharing, scalability, and compatibility in Databricks workflows.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The main file storage options in Databricks are: Option Best Use Case Security\/Governance Notes Unity Catalog Volumes Data, artifacts across workspaces Strong Recommended, scalable Workspace Files Notebooks,&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-423","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/423","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=423"}],"version-history":[{"count":1,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/423\/revisions"}],"predecessor-version":[{"id":424,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/423\/revisions\/424"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=423"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=423"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=423"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}