Best practices for Cloud Storage

Cloud Storage is ideal for

  • Store and serve static content such as HTML, CSS and JS
  • Store and retrieve a large number of files
  • Store multiterabytes files

Resources are entities including:

  • projects
  • buckets - the basic container.
    • they contain mutable objects
    • bucket names need to be unique in the whole world.
    • they may contain dots if it’s a domain
  • objects - individual piece of data

Storage classes

Storage Class Characteristics Use Cases Price
multi-regional geo-redundant serving web content; streaming videos; mobile apps $0.026
regional data stored in narrow geo region data analytics $0.02
nearline data retrieval cost; higher per-operation cost; 30-day min. storage backups; serving multi-media content $0.01
coldline data retrieval cost; higher per-op cost; 90-day min. storage disaster recovery; data archiving $0.007

You can mix in a bucket regional or multi-regional + nearline + coldline but regional and multi-regional cannot go together.

Bucket and/or Object operations

Strongly consistent: when you perform an operation in Cloud Storage and receive a success response, the object is immediately available for download and metadata operations.

Eventually consistent: when you perform an operation, it may take some time for the operations to take effect.

Composite objects and parallel uploads

Object composition can be used for uploading an object in parallel.
You can combine up to 32 existing objects into a new object without transferring additional object data. This can be used for uploading an object in parallel.
Simply divide your data into multiple chunks, upload each chunk to a distinct object in parallel, compose your final object and delete any tmp files.

ideal for:

  • divide your data and upload each chunk to a distinct object, composing your final object and deleting any tmp object
  • upload data to a temporary new object, composing it with the object you want to append it to, and deleting the tmp object

Exponential backoff

truncated exponential backoff

  • is a standard error-handling strategy for network apps
  • periodically retries failed requests with increasing delays in between requests
  • should be used for all requests to Google Cloud Storage that return HTTP 5xx and 429 error codes

Best practices

Bucket names are global and publicly visible. Use globally unique names or GUIDs or the equivalent if your app needs a lot of buckets. They should conform to standard DNS naming conventions.
Don’t use personal identifiable information or IPs. If you try to put a domain or subdomain, google will try to verify you own it.

Your app should have retry logic in place to handle name collisions.

Practices for Cloud Storage traffic

  • consider: operations per second; bandwidth; cache control
  • design your app to minimize spikes in traffic
  • use exponential backoff if you get an error
  • for request rates > 1000 writes/s or 5000 reads/s
    • start with a rate below or near this threshold
    • double the request rate no faster than every 20 mins

Consider location and availability of your data for storage options

Store data in a region closest to your app users. Also consider region specific compliance requirements when choosing location.
For analytical workloads, store your data in regional buckets to reduce network charges and for better performance, as compared to multi-regional.

Multi-regional and regional storage provide the best availability at a higher price. They’re good options for data that is served at a high rate with high availability.
Nearline and coldline storage are good options for infrequently accessed data, and for data that tolerates slightly lower availability.

Secure your buckets

You can control access to your buckets through

  • use IAM (Identity and Access Management) permissions to grant access to buckets and to ride bulk access to bucket’s objects. They don’t give fine grained control over individual objects.
  • ACL (Access Control Lists) to grant read or write access to users for individual buckets or objects. They’re only recommended for when you need fine grained control over individual objects.
  • signed URLs (query String authentication) to provide time-limited read or write access to an object through a URL you generate. They can be created programmatically or through gsutil
  • signed Policy Documents allow you to specify what can be uploaded to a bucket. They allow greater control over size, content-type and other upload characteristics than signed URLs. They’re for website owners to allow visitors to upload files to Google Cloud Storage. They only work with foreign posts.
  • Firebase Security Rules provide granular attribute based control to mobile and web apps using the Firebase SDK for Cloud Storage.

Security best practices

  • Use HTTPS to transport data
  • Use an HTTPS library that validates server certificates. A lack of it makes you vulnerable to MITM attacks.
  • Revoke authentication credentials to apps that no longer need access data
  • Securely store credentials
  • Use groups instead of large numbers of users
  • Bucket and Object ACLs are independent of each other
  • Avoid making buckets publicly readable / writable. After a bucket has been made publicly readable, data on the Internet can be copied to many places. Done that it’s effectively impossible to regain read control over an object.

Uploading-data best practices

  • If using XMLHttpRequests
    • don’t close and re-open the communication. Doing this creates a bad positive feedback loop during times of network congestion.
    • set reasonable timeouts.
  • Make the request to create the resumable upload URL from the same region as the bucket and upload location.
  • Avoid uploading content that has both content-encoding gzip and content-type that is compressed.
  • Avoid breaking transfers into smaller chunks

gsutil for Cloud Storage

  • gsutil -D if you use it to generate debugging output, it will include OAuth2 refresh and access tokens in the output. Make sure to redact this information before sending this debug to anyone.
  • gsutil --trace-token will include OAuth2 tokens and the contents of any files accessed during the trace.
  • Customer-supplied encryption key information in .boto config is security-sensitive. The proxy config is security-sensitive, especially if your proxy setup requires user and password. Protect access to your .boto config file.
  • In a prod environment, use a service account for gsutil instead of individual user-accounts. These credentials were designed for such use.

validate your data

Data can be corrupted while it’s uploaded to or downloaded from the cloud. Validate the data that you transfer to/from buckets by using either

  • CRC32c hash
    • available for all cloud storage objects
    • gsutil automatically performs integrity checks on all uploads & downloads
    • can be computed with libraries for C++, Python, Java & Ruby
  • MD5 hash
    • supported for non-composite objects
    • cannot be used for partial downloads

If your app hash already calculated one of those 2 hashes for your object before starting the upload, you can supply it with the upload request, and the object will only be created if the computed hash and your match.