Best practices for Cloud Datastore

Cloud Datastore is a fully managed, NoSQL database service that you can use to store structured or semi-structured app data and is scalable. It can store from zero to millions of requests per second.

If you need to execute ad hoc queries on large data sets without previously defined indexes, you should use Google BigQuery instead.

It has a dependency on creating an App Engine application.
There’s a query language called GQL that you can use to query entities in Datastore. It’s very similiar to SQL.

Ancestor queries of entity groups give you a strongly consistent view of the data. By creating entity groups, you can ensure that all related entities can be updated in a single transaction.

Cloud Datastore automatically builds indexes for individual properties in an entity. To enable more complex queries with filters on multiple properties, you can create composite indexes.

It can scale with zero downtime. Make sure to ramp up traffic gradually. A general guideline is to use the 500-50-5 to ramp traffic up. This means a base write rate of 500 writes per second and increase it by 50% every 5 minutes. Distribute your writes across a key range.

Cloud Datastore concepts and indexes

Cloud Datastore concepts

  • Data objects are called entities. Each entity is of a particular kind
  • Entities are made up of 1 or more properties
  • Each entity has a unique key composed of namespace, entity kind, string or numeric ID and (optional) ancestor path
  • Operations on one or more entities are called transactions
  • You can specify ancestor path relationships between entities to create entity groups.

When you create an entity, you can specify another entity as its parent. An entity without a parent is a root entity
All the parents and parent’s parents are called ancestors. All the child and child’s child are descendants
The sequence of entities from the root entity to a specific entity is the ancestor path

Cloud Datastore indexes

Datastore has 2 types of indexes: built-in and composite indexes

Built-in indexes are sufficent to paerform many simple queries such as equality only and simple inequality. For more complex queries, an app must define composite or manual indexes.

Without a composite index you’re able to query by several parameters as long as it’s an equal operation. When it’s a greater or lesser than operation, we need such an index.

Best Practices

If a property will never be needed for a query, exclude the property from indexes. This reduces latency.
Avoid having too many composite indexes. Too much increases latency.
Do not index properties with monotonically increasing values. This leads to hotspots for apps with high read and write rates.

Create and delete a composite index

Composite indexes are defined in the app index config file index.yaml. They are viewable but not editable through the GCP console.

To create one

  1. modify the index.yaml config file to include all properties to be indexed
  2. run gcloud datastore create-indexes

to delete one

  1. modify the yaml to remove the indexes you no longer need
  2. when you’re sure you no longer need it run gcloud datastore cleanup-indexes

Creating an index will take some time. Queries run meanwhile will result in an exception.

When you change or remove an index, the original index is not deleted automatically. When you’re sure that old indexes are no longer needed, use the command to delete all indexes from the prod instance that are not mentioned in the config file.

Cloud Datastore vs RDBMS

Concept Cloud Datastore RDBMS
category of an object Kind Table
one object Entity Row
individual data for an object Property Field
unique ID for an object Key Primary Key

Key differences are that Cloud Datastore is designed to automatically scale to a very large data set, as it maintains high performance by writing scale by automatically redistributing data as necessary. Datastore also reads scale because the only queries supported are those whose performance scales with the size of the results set. So a query whose result set contains a hundred entities performs the same whether it searches over a hundred entities or a million.

For this reason, some types of queries are not supported. All queries are served by previously built indexes. So the types of queries that can be executed are more restrictive than those allowed on a RDBMS with SQL.
Cloud Datastore does not include support for join operations, inequality filtering on multiple properties or filtering on data based on results of a subquery.

It doesn’t require entities of the same kind to have a consistent property set.

Design considerations

Always use UTF-8 characters and avoid forward slashes / for kind names and custom key names.
Avoid storing sensitive information in a Cloud Project ID, as it might be retained beyond the life of your project.

Avoid high read or write rates to keys that are close.

You should use sharding or replication for hot keys.

Ramp up traffic to new Kinds gradually in order to give BigTable time to split tables as the traffic grows.

Avoid deleting large numbers of entities across a small range of keys. BigTable periodically rewrites its tables to remove deleted entries and to reorganize your data so that reads and writes are more efficient. This process is known as compaction.

When performing a large number of deletes for a range of entities, if those entities have an index on a timestamp field, they will be stored closely together. When all those entities are deleted, performance for queries in that part of the index will be slower until compaction has been completed.

Transaction throughput is limited to 1 write per second per entity group
Split frequently updated entities across multiple kinds

Sharding and replication

Replication can be used to read a portion of the key range at a higher rate
Sharding can be used to write to a portion of the key range at a higher rate. It splits a single entity into many.

By design Datastore does not handle high amounts of writes to a single entity group. The recommendation is to shad entities. If you update an entity group too rapidly then your writes will have higher latency and other types or errors known as contention.

Cloud Datastore is built on top of Google’s NoSQL database BigTable. It scales by sharding rows onto separate tables and these rows are ordered by key.

Datastore and BigTable will shard automatically, but if your app requires more writes than the limits you can shard manually too.

In order to avoid contention with high writes you should share counters. To update an entity faster than the recommended five times a second you can shard counters.

Use replication to read a portion of the key range.
If your application is bound by read performance replication may be a better option for your application.

You can use replication if you need to read a portion of the key range at a higher rate than Bigtable permits. Using this strategy you would store N copies of the same entity, allowing an N times higher rate of reads than is supported by a single entity.
A standard use case for this is a static config file, which would get loaded per request. In this case your app would have a static number of config objects.

Replication, Query types, transactions and handling errors

Remember to use batch operations for reads, writes and deletes instead of single operations. This allow you to perform multiple operations with the same overhead as a single operation.

Where available use async calls to minimize latency impact.

Use query types based on your needs

Keys-only Projection Ancestor Entity
retrieve only the key retrieve specific properties from an entity return strongly consistent results retrieve an entity kind, zero or more filters, and zero or more sort orders
return results at lower latency cost retrieve only the properties included in the query filter requires your data to be structured for strong consistency  
  return results at lower latency cost    

Improve query latency by using cursors instead of offsets

Using an offset only avoids returning the skipped entities to your application but theses entities are still retrieved internally. The skipped entities affect the latency of the query.

Design your keys with these considerations in mind

For a key that uses a numeric ID Do not use a negative number or 0 To assign your own numeric IDs manually to the entities you create, have your application obtain a block of IDs with allocateIds(). This will prevent from assigning one of your manual numeric IDs to another entity. If you assign your own manual numeric ID or custom name, do not use monotonically increasing values.

Transactions are a set of Datastore operations on one or more entities. They are guaranteed to be atomic. Max. transaction time is 60 seconds but if a transaction lasts for more than 30 secs, it will be terminated if there’s no activity for 10 seconds. Transactions may fail when too many concurrent modifications are attempted on the same entity group, transactions exceed the resource limit, Datastore encounters an internal error or Datastore operations in a transaction operate on more than 25 entity groups. If your app receives an exception when committing a transaction, it does not always mean that it failed. Whever possible make your transactions idempotent. This is a operation without additional effect if it is called more than once with the same input parameters. So then if you repeat a transaction the end result will be the same.

Design your app to handle errors

If a transaction fails, try to roll back the transaction. Having a rollback in place will minimize retry latency for concurrent requests of the same resources in a transaction.

When a request exceeds the API will return an HTTP 200 OK and the requested data in the body of the response. Failures return an HTTP 400 or 500.

Error Recommended Action  
ALREADY_EXISTS Do not retry without fixing the problem  
DEADLINE_EXCEEDED Retry using exponential backoff  
INTERNAL Do not retry this request more than once  
ABORTED For a transactional commit, retry the request or struture your entities to reduce contention  
”” For requests that are part of a transactional commit, retry the entire transaction or structure your entities to reduce contention