How to model user data using SQL and NoSQL databases

There are many debates about which type of database is better. I will focus on how to use them for a specific case and how to evolve the design as needed.

SQL vs NoSQL
The case to consider
1) The basic scenario
2) Adding flexibility per customer
3) Total flexibility using a SQL database
4) Using a NoSQL database
Conclusion

SQL vs NoSQL

There are many types of databases, and each of them fits better for a different scenario. For example, PostgreSQL works well for the normal usage of a website as it is quite balanced, Redshift works great as a data warehouse for reporting as it offers very fast read access and a structured data format, and MongoDB can support an application that handles thousands of orders per second as it is quite fast and scales via sharding.

Organisations used to choose only one type, but the appearance of microservices has allowed them to use a combination of them, and in some cases more than one for the same data, as happens with CQRS. The only limitations are now the learning curve and the people needed to maintain it. The maintenance effort can be nearly zero with databases as a service like DynamoDB, but can be quite high with others like Cassandra, so it is an important factor to consider.

The case to consider

I am going to focus on a very specific case, which is how to model the data of a user registration form. I will start with a basic scenario that fits well in a structured table and will progress until the point where it is better to migrate it to a more flexible format.

I will mention basic types like string, but all the fields should be stored encrypted. They contain personal information that would be leaked if a hacker got access to the database and the data were in plain text.

1) The basic scenario

Let’s assume that we have a basic screen with a fixed set of fields that are the same for each user. For example, Name, Age, and Country. It is obvious that Name would be a string and Age would be a number. The country could be modelled as a string, but it would be problematic because each person may write it in a different way (“Spain”, “spain”, “España”, …), and we would have multiple occurrences of the same. It would also be difficult to filter by it in a user listing.

Due to this, it would be better to have a separate table with a list of countries and an identifier pointing to it. This would also allow us to have a dropdown field in the form to select the country from a list instead of having to type its name.

This could be stored perfectly fine in MySQL, and it would be very easy to run reporting queries on it. It would also be quite easy to detect if the country ID doesn’t exist using a foreign key. We could also add some restrictions on the database level, like the maximum name length of the name or making the age optional or mandatory, although that should also be validated in the back-end code before doing the queries.

2) Adding flexibility per customer

Let’s say that we have a SaaS application, and we onboard a couple of customers who have slightly different needs for storing the data of their employees. For example, customer A may need to track which office the employee will work in, and customer B may want to store the employee’s manager.

It is just a couple of fields that don’t require big changes, and other customers may also want to use them in the future. We could just add a couple of optional columns to the table. In case the columns are mandatory per customer, we could add a validation in the application code. We could also add a constraint in the schema based on the customer, but that would affect the performance for all the customers.

3) Total flexibility using a SQL database

Let’s suppose that our SaaS application is quite successful, and we get more big customers. One of them may want to store “Phone number”, “Email address”, “Family size”, “Requires visa” … while others may need “Cost centre”, “Payroll code”, … The structured schema that we had before wouldn’t work anymore, as we would quickly reach the maximum number of columns per table that the database supports. The table would also be unnecessarily big, and most of the cells would be empty.

We could choose a couple of approaches, assuming that we want to keep using the same database:

3.1) Have a different table per customer: If only two big customers have these requirements and the others are happy to use the standard columns, we could have the common ones in one table and the different ones in separate tables.

The problem with this is that it doesn’t scale. It would make the back-end logic quite complicated, as we would need different queries per customer. Imagine with hundreds of customers…

3.2) Have a flexible structure using key/value pairs: If each customer needs a completely different set of columns, we could have the common fields in a separate table and then use a shared one. The specific ones could be in a key-value pair, in which each row would represent a field of each user. Each row may store a string, a number, a boolean value, … so we have to consider it.

This would give us the flexibility needed and would scale and support any number of customers. It would also allow us to keep using the same database, which is perfect if we are using a monolith or there are some related tables that are used by the same service and cannot be modified.

However, it is not a very desirable solution as we would need dozens of rows per employee. It would also complicate the application code as we would have to retrieve all of them and combine the data in the back end.

4) Using a NoSQL database

These databases offer the flexibility needed as they don’t have a rigid schema. We could have a different structure per customer, what would be like this:

User example for customer A:

{
    id: 1,
    name: Claire,
    country: 1,
    customer: A,
    phone number: “077070707”,
    email address: “claire@email.com”,
    family size: 3,
    requires visa: yes,
    ...
}

User example for customer B:
{
    id: 2,
    name: Paul,
    country: 2,
    customer: B,
    cost centre: “London Office”,
    payroll code: “PAY-123″,
    ...
}

This solution would be quite simple to implement in the backend, as the queries would just store and retrieve users without caring too much about the data structure. The challenge of this approach is to do reporting or data analysis queries. A way of doing it would be copying the data to a more structured format, for example, in the ETL layer.

Conclusion

We have seen different ways of structuring the data depending on the evolving needs of the application. There isn’t a solution that fits perfectly for the most complex case, so it is up to the team to discuss the trade-offs and choose it based on the architecture, the flexibility, and the cost of combining different technologies

No comments to show.

Rafael Borrego's blog