[Software Architecture: The Hard Parts][Chapter 6] Pulling Apart Operational Data Part 1: Data Decomposition Drivers
Introduction
Hello Everyone! In this article, I'm going to be going over a continuation of the previous article (Component-based decomposition patterns). After having several coarse-grained services, eyes turn to data and questions need to be asked whether it will be required to split the data or keep using a monolithic database.
Data is always the most important asset for a company. There is a greater risk of business and application disruption when breaking apart or restructuring data.
Interestingly enough some of the techniques used to break apart application functionality can be applied to breaking apart data as well. Components translate to data domains, class files translate to data tables and coupling points between classes translate to database artifacts such as foreign keys, views, triggers or even stored procedures.
For every decomposition decision, there exist some drivers. These drivers 'drive' our decision.
Data Decomposition Drivers
Understanding and analyzing data disintegrators (drivers that justify breaking apart the data) and data integrators (drivers that justify keeping the data as is) is a very important process. Let's start with the disintegrators.
Data Disintegrators
These disintegrators provide answers and justification for the question "When should I break apart my data?". There are six main disintegrator drivers as follows;
Change Control
How many services are impacted by a single table change? (dropping tables, removing columns, etc)
Connection Management
Can the database handle the connections needed from multiple distributed services?
Scalability
Can the database scale to meet the needs of services accessing it?
Fault Tolerance
How many services are impacted by a database crash?
Architectural Quanta
Is a single database forcing me into an unwanted single architecture quantum?
Database type optimization
Can I optimize my data by using multiple database types?
Change Control
Dropping tables or columns, changing table or column names or even changing the column type might break corresponding services using those tables. These are usually called breaking changes as opposed to adding tables for example which doesn't cause a problem.
Multiple services have to be updated, tested and redeployed if several services use the same column or table. The coordination can quickly become difficult and error-prone as the number of deployed services increases. Imagine trying to coordinate 42 separately deployed services for a single-breaking database change!
The real danger is forgetting about the services that use the changed table as they will become nonoperational in production until they can be fixed.
Breaking the database into well-defined bounded contexts significantly helps control breaking database change. The bounded context concept comes from the Domain Driven Design book and describes the source code, business logic, data structures and data all bound together -encapsulated- in a specific context.
Well-formed bounded contexts around services and their corresponding data help control change because change is governed within the services of the specific context.
Most typically, bounded contexts are formed around services and the data they "own". By "own" we mean the services that make writes to this database.
The most important rule regarding bounded context is if one service in some context requires data from another service in another context. It has to request the data from the service responsible for the data, not the database directly. As this will mess up the bounded context concept. This also abstracts the database from the actual contract between both services.
Connection Management
Establishing a connection to a database is a very expensive operation, A database connection pool is used to increase performance and limit the number of concurrent connections the application is allowed to use.
In distributed services, each service has its connection pool, when multiple services share the same database, the connection to the database can become quickly saturated particularly as the number of services or service instances increases.
Reaching or exceeding the maximum number of allowed database connections is an important driver for deciding whether we break the data or not.
Frequent connection waits (the amount of time it takes for a connection to become available) is usually one of the first signs that the maximum number of connections has been reached. These can be in the form of request time-outs too.
One viable solution is to assign every service or flock of services a connection quota. Which specifies the maximum number of connections allowed for a particular service.
Usually after monitoring the needs and scalability of each service, we can assign the connections accordingly. Not all services should have the same quota it's entirely dependent on the needs of each service.
Scalability
One of the biggest advantages of distributed systems is scalability. The ability of a system to handle an increase in request volume while maintaining the same response time. Service scalability can put a tremendous strain on databases.
For a distributed system to scale, all parts of the system must scale, including the database.
Database connections, capacity, throughput and performance are all factors in determining whether a shared database can meet the scalability demands of multiple services.
By breaking apart the database, less load is put on the database increasing the scalability and performance overall. This is achieved by breaking the data into data domains as mentioned before.
Fault Tolerance
When multiple services use the same database, the system becomes less fault-tolerant due to the presence of a single point of failure which is the database.
Fault tolerance is the ability of the system to continue operating when a fault occurs (service or database fails, etc).
If fault tolerance is a main requirement for the system, breaking apart the data can help achieve that.
Architectural Quantum
An architectural quantum is nothing but an independently deployable artifact with high functional cohesion, high static coupling and synchronous dynamic coupling.
A system with a single database will always be one single architectural quantum. Because the database is needed in the functional cohesion part of the quantum definition.
Breaking apart the data can give each domain a quantum so the whole system doesn't have to be one architectural quantum.
Data Type Optimization
Not all data is treated the same, when using a monolithic database, all data must adhere to the same data types, performance, etc providing potentially sub-optimal solutions for certain types of data.
Breaking apart monolithic data allows the architect to move certain data to a more optimal database type. For example, Key-Value records that reside in a monolithic database could be moved to a key-value database for better optimization of that specific data type.
Data Integrators
Integrators do the exact opposite of disintegrators, they provide answers and justifications for the question "When should I consider putting data back together?"
The integration drivers are as follows;
Data Relationships
Are there foreign keys, triggers, and views that form close relationships between the tables?
Database transactions
Is a single transactional unit of work necessary to ensure data integrity and consistency?
Data Relationships
Like components in an architecture database tables can be coupled as well. Foreign keys, triggers, views and stored procedures tie tables together. Making it difficult to pull data apart. However, database tables in the same bounded context can have the database artifacts preserved.
The relationship between data either logical or physical is an integration driver. It creates trade-offs between integrators and disintegrators. For example, is fault tolerance more important than preserving foreign keys and relationships? In architecture analyzing trade-offs with requirements will give you the answer you need.
Database Transactions
One of the biggest advantages of having a single database is database transactions. Services that want to update multiple entities in an ACID manner (Atomicity, Consistency, Isolation and Durability) can do it easily using a single database, however when splitting or breaking apart the data into separate schemas that single transactional unit doesn't exist anymore because of the remote calls between services. That means an insert can be made in one table but not in other tables because of error conditions, resulting in data inconsistency and integrity issues.
Summary
Integrators and Disintegrators are a great way of analyzing trade-offs as an architect. They help Answer questions regarding data which always helps in making the correct decision. In the next article, We'll be talking about pulling the monolithic data apart. Thanks and till the next one.