Storage options for Cloud TPU data
This document describes data storage options that can be used when training models on Cloud TPU.
Introduction
Cloud TPU requires data storage for:
- Dataset downloading and preprocessing
- Host input pipeline processing
- Model training input
- Model training output
The storage options for the Cloud TPU application data and training datasets are:
- Durable block storage, including the boot disk and attached storage disks
- Cloud Storage buckets
- Cloud Storage FUSE
- Filestore file share on a Compute Engine VM
For more information about managing storage, see the following pages:
Durable block storage
Durable block storage, also known as disks or volumes, is for data that you want to preserve after you stop, suspend, or delete your TPU VM. Durable block storage is still available even if the TPU VM crashes or fails. You can use the TPU VM boot disk or attach additional block storage to your TPU.
You might want to attach an additional disk in the following scenarios:
- The size of your training dataset exceeds the size of the TPU boot disk.
- You have read-only data and want faster read access using a Hyperdisk ML volume.
You can attach two types of durable block storage to a Cloud TPU: Google Cloud Hyperdisk and Persistent Disk. Persistent Disk is not supported for the latest machine series, including Cloud TPU v6e. Google recommends using Google Cloud Hyperdisk for the highest performance and advanced features.
TPU VM boot disk
By default, each Cloud TPU VM has a single 100 GiB boot disk that contains the operating system. The boot disk can also be used for temporary storage of downloaded datasets for preprocessing and model input and output data, as long as the total amount doesn't exceed the available space on the boot disk.
You can't resize the boot disk on a Cloud TPU. If your application requires additional storage space beyond the boot disk default, you can add one or more durable disks to your TPU VM instance. For more information, see Attach durable block storage to a TPU VM.
Attached storage
Both Hyperdisk and Persistent Disk are durable network storage devices that your VM instances can access like physical disks in a desktop or a server. Both types of disks are created independently from your virtual machine (VM) instances, so you can keep your data even after you delete your VM instances.
Advantages of using Hyperdisk over Persistent Disk include customizable performance, higher IOPS and throughput limits. For more information about Hyperdisk and Persistent Disk, see Choose a disk type.
For more information about using durable block storage with TPU VMs, see Attach durable block storage to a TPU VM.
Disk backups
It can be difficult to retrieve the data from the boot disk if the TPU VM gets stuck in an "unknown" state or to recover deleted data. Make sure to back up your data using another storage option, such as Cloud Storage buckets.
If you store data on an attached disk, you can use disk snapshots, which incrementally back up data on a disk. Disk snapshots aren't supported for the TPU boot disk. For more information, see About disk snapshots.
Cloud Storage buckets
Cloud Storage buckets are the most flexible, scalable, and durable storage option for your VM instances. If your training job does not require the lower latency of durable block storage, you can store your dataset in a Cloud Storage bucket.
The performance of Cloud Storage buckets depends on the storage class that you select and the location of the bucket relative to your instance.
Creating your Cloud Storage bucket in the same zone as your TPU VM gives performance that is comparable to durable block storage but with higher latency and less consistent throughput characteristics.
All Cloud Storage buckets have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Checksums are calculated for all Cloud Storage operations to help ensure that what you read is what you wrote.
Unlike durable block storage, Cloud Storage buckets are not restricted to the zone where your instance is located. Additionally, you can read and write data to a bucket from multiple instances simultaneously. For example, you can configure instances in multiple zones to read and write data in the same bucket rather than replicate the data to durable block storage in multiple zones.
For more information about connecting your TPU VM to a Cloud Storage bucket, see Connecting to Cloud Storage buckets.
Cloud Storage FUSE
Cloud Storage FUSE lets you mount and access Cloud Storage buckets as local file systems. This allows applications to read and write objects in your bucket using standard file system semantics.
See the Cloud Storage FUSE documentation for details about how Cloud Storage FUSE works and a description of how Cloud Storage FUSE operations map to Cloud Storage operations. You can find additional information about how to use Cloud Storage FUSE, such as how to install the Cloud Storage FUSE CLI and mounting buckets on GitHub.
Filestore file share
Filestore file share is a fully managed network attached storage (NAS) for Compute Engine. Filestore offers compatibility with existing enterprise applications and supports any NFSv3-compatible client.
Filestore offers low latency for file operations. For workloads that are latency sensitive, Filestore supports capacity up to 100 TiB and throughput of 25 GiB per second and 720K IOPS, with minimal variability in performance.
With Filestore, you can mount file shares on TPU VMs.
What's next
- Learn how to add durable block storage to your instance.
- Learn how to connect your instance to a Cloud Storage bucket.