Azure CosmosDB – the new dog in the Cloud. Surely our old friend SQL is good enough?

For years, as developers we have chosen by default some form of SQL or relational database management system – RDBMS to store application data.

Written by Nigel Wardle

Cloud Modernisation Practice Manager

Azure CosmosDB

For years, as developers we have chosen by default some form of SQL or relational database management system – RDBMS to store application data. Probably because that is all there was at the time. But now we are entering the age of the cloud and alternatives are beginning to appear. so how does RDBMS fair?

Reasonably well – If we consider SQL Azure (which is a cloud version of Microsoft SQL Server). It supports encryption both in transit and at rest. It can automate much that a DBA would do, such as – table indexing, backup and replication to other instances. We still have to distribute (or normalise) our data across related tables – so nothing has changed there. If we are using C#, then we can configure our code to automatically generate the database table structure (schema) using Entity Framework in ‘code first’ mode.

So what’s wrong with RDBMS? The simple fact is that it has serious trouble in ‘scaling out’ or ‘horizontally scaling’ for truly global data storage – meaning we can’t simply add a new second server (without any code changes) when needing more writable resources. It can only ‘scale up’ by configuring the cloud service to provide more resources – allocating more processing power or memory to the writable server. The bottom line is that for most applications there is one writable database hosted on one server.

Note: To be fair, there are a few workarounds. One way is having identical writable databases on separate servers and then a choice of processes to keep them in sync. Another way is implementing a technique known as ‘sharding’. This is where there are again multiple databases but the client application filters (or partitions) the incoming data using a particular column or field and hard codes the distribution of records across the databases. Both methods often add complexity or custom logic to the client applications that support them.

So what is CosmosDB, what is different about it, and under what circumstances can we consider it? CosmosDB is Microsoft’s newer cloud native database. It comes in different connectivity. In this article we will discuss its SQL API. Note: It also has other supporting APIs: MongoDB, Cassandra, Table & Graph.

At the heart it hosts JavaScript Object Notation (JSON) text based documents. These documents store our data in a structure that we define (no fixed schema) and so in theory every document could be different. Each document has a maximum size of 2MB. Once setup, our instance has a hierarchical structure in the form: <my-account>.documents.azure.com/<my-database>/<container>/<item>. If we choose the SQL API form then container will be a document collection and item will be a JSON document.

If we now have a data record of a single person to store – instead of distributing across many relational tables, the complete record can be stored in a single JSON document. When we query for records, it simply returns the contents of the JSON documents. Compare this to RDBMS which first has to work out how to join up the data from the separate tables before it can consider returning anything. Hence:

CosmosDB can have extremely low latency, meaning we can get the results at lightning speeds. By the way – we don’t have to create indexes, as this is done for us!
The JSON documents can be copied or replicated to other CosmosDB instances. Each copy can be further edited and replicated back. Hence no particular document is the master. All databases can be writable and we can have many globally replicating databases. Hence we can ‘scale up’ and ‘scale out’.
Additionally we can configure a document collection with how to partition the data across separate servers. We do this by specifying
A non changing field in our documents as a partition key. The bottom line is that as documents are added to the collection they are distributed behind the scenes to separate servers which can then be queried in parallel.
There are five levels of consistency: STRONG, BOUNDED STALENESS, SESSION, CONSISTENT, PREFIX, EVENTUAL. In simple terms we might favour STRONG for an airline ticketing application and EVENTUAL to record ‘likes’ in a social media app. But we wont go into this today!

We can execute SQL like queries, or use code, to save documents, return one or more documents, or even return portions (projections) of documents – and through the SDK easily convert to and from JSON to in-memory objects. But what if we require information that is scattered across many different documents? In this scenario the client application may have to download each document separately and picking out what is needed (joins between documents are not yet supported). So why not avoid this by duplicating this data into another single document instead?

We have become so fixated in normalising into separate tables that the idea of duplicating can make us feel very uncomfortable. And we have done this without a care as to how applications might read or write data. Of course we would make certain that we created sufficient table indexes to avoid full table scans!

In CosmosDB we should rethink – planning our JSON documents so that they are optimised for both read and write tasks we want our application to perform.

Let’s look at an example of how we can typically store data relating to a persons contact details in a 4 table relational database.

We can obtain a person record by executing a single query of the database but incurring an expensive join between the 4 tables.

SELECT p.FirstName, p.LastName, a.City, cd.Detail

FROM Person p

JOIN ContactDetail cd ON cd.PersonId = p.Id

JOIN ContactDetailType on cdt ON cdt.Id = cd.TypeId

JOIN Address a ON a.PersonId = p.Id

If we then wanted to update a field in each table then we would have to submit an individual query update for each table. Hence 5 server round trips

So how does it compare with storing a single person record in CosmosDB.

Here we embed the complete record in one document which we can obtain from a single query. We can then update anything in the record and save it back in a single operation. Job done!

{

“id”: “1”,

“firstName”: “Thomas”,

“lastName”: “Andersen”,

“addresses”: [

{

“line1”: “100 Some Street”,

“line2”: “Unit 1”,

“city”: “Seattle”,

“state”: “WA”,

“zip”: 98012

}

“contactDetails”: [

{“email: “thomas@andersen.com”},

{“phone”: “+1 555 555-5555”, “extension”: 5555}

]

}

The big assumption is that the person has a limited or bounded number of addresses or contact details. If however, we needed to additionally store the date and time for each login, then the corresponding embedded list and would grow over time without bounds, possibly growing into Megabytes in size. Every time we wanted to edit a single field we would have to download this large document, find the field, edit it and then post it back to save it. Hence for performance and code complexity reasons it would be wise move login details out of this document and into one or more other documents that can be referenced.

Let’s look at an example of how we could store a many to many unbounded relationship in SQL and also in CosmosDB.

Author Documents:

{“id”: “a1”, “name”: “Thomas Andersen”, “books”: [“b1, “b2”, “b3”]}

{“id”: “a2”, “name”: “William Wakefield”, “books”: [“b1”, “b4”]}

Book Documents:

{“id”: “b1”, “name”: “Azure Cosmos DB 101”, “authors”: [“a1”, “a2”]}

{“id”: “b2”, “name”: “Azure Cosmos DB for RDBMS Users”, “authors”: [“a1”]}

{“id”: “b3”, “name”: “Learn about Azure Cosmos DB”, “authors”: [“a1”]}

{“id”: “b4”, “name”: “Deep Dive in to Azure Cosmos DB”, “authors”: [“a2”]}

Here we have six individual documents that have embedded items that reference each other by using an id field. This can work well for a small number of items but might not be optimal if the unbounded number of books or authors increased indefinitely.

Note: references between documents are not checked or enforced. Compare that to enforced Primary Key – Foreign Key relationships in relational databases that maintain database integrity.

In summary, if we only have bounded relationships, then for many reasons CosmosDB can seriously challenge RDBMS. For unbounded relationships, this can take a bit more thought but can still work very well.

For further reading please click here for the excellent article on which this blog is based.

Azure CosmosDB – the new dog in the Cloud. Surely our old friend SQL is good enough?

Written by Nigel Wardle

Azure CosmosDB

In CosmosDB we should rethink – planning our JSON documents so that they are optimised for both read and write tasks we want our application to perform.

Outcomes

Quick Links

Capabilities

Azure CosmosDB – the new dog in the Cloud. Surely our old friend SQL is good enough?

Written by Nigel Wardle

Azure CosmosDB

In CosmosDB we should rethink – planning our JSON documents so that they are optimised for both read and write tasks we want our application to perform.

SUBSCRIBE TO OUR BLOG FOR WEEKLY UPDATES