Data-centric project requirements?
Several times last week, in different circumstances, I was asked a question containing these three words or their synonyms. That’s not new. It happened previously. But this concentration triggered the write-up that follows. Nothing original and neither is the reason to write it:
Everything that needs to be said has already been said. But since no one was listening, everything must be said again.
— André Gide
Let’s first clarify what is data-centric and then see why it doesn’t go well with project and even less so with requirements.
What is data-centric?
The short answer is in these three¹ principles:
- Data is self-describing and does not rely on an application for interpretation and meaning.
- Data is expressed in open, non-proprietary formats.
- Applications are allowed to visit the data, perform their magic and express the results of their process back into the data layer.
I find these three are the most important data-centric principles, but another reason for selecting them is that they are context-independent. The data-centric manifesto they are taken from is currently with an enterprise-only focus². Yes, the problem they address is most severely felt — or rather not felt because of ignoring or misattributing the symptoms — in large organizations. Yet, what is behind these principles is equally important for personal information management and on the open web. Let’s go quickly through all three levels then, from big to small, and see what data-centric means for the world wide web, for corporate IT, and then for personal information management.
The web was designed to be a decentralized system where the agreement on a few standards, basically HTTP and HTML, enabled free choice on just about anything else. People were finally free to express themselves and to choose from where and how to get information. They got free to innovate on building new browsers, websites, and whatever web applications and services they can think of. A system like this, with a self-maintained organization, can work well and have a natural tendency for virtuous cycles. In other words, it can amplify goodness and develop its own immune system for whatever threatens its viability. All it needs is to have the right kind of enabling constraints, for example, the standards I mentioned above, and to allow autonomy of all subsystems. This is the balance between autonomy and cohesion. It works for animals, people, tribes, organizations, society, and a socio-technical system like the web.
So the web flourished as a decentralized system, where people were free to choose and create more choices. And then one day the platforms appeared. They offered good and free services. Or at least they looked good and free at first. In reality, they were (and are) neither good nor free. The platforms are not nearly as good information providers as it was the decentralized web before them. What we see is not what we are looking for, but what their algorithms decide to show us. And the services of these platforms are not free. Quite the contrary. We pay with our data, and we pay twice. Once by being their content providers and a second time by giving them our personal data. Importantly, we don’t give them only our current personal data but also future ones, by allowing them to track our online behaviour. Who’s them? I’m talking of course about IT giants like Google, but the best example of extreme centralization and lock-in is Facebook³. In this way, the web, a decentralized system, shaped by the users, turned into a hyper-centralized system, shaped by a few powerful corporations⁴. It also formed users’ expectations. In 2019 Facebook and Google announced that it was now possible to copy images from Facebook to Google Photos. That’s the new norm for innovation. Only a few people noted the absurdity. As Ruben Verbourgh pointed out, 50 years after being able to send video signals over a distance of 380,000km, we celebrate that we can finally move a photo by 11km (the distance between Facebook and Google headquarters). A bit dystopian, isn’t it?
Yet, the problems with this centralization are not widely understood. For example, very few people realize how platform-based political propaganda works, and that’s why it works so well. Even fewer relate it to the hyper-centralization of the web. Same with fake news and so on. Maybe the least understood of the damages is how it suffocates innovation. It’s easy to illustrate. Even when you use Google for product search, where it should excel after so many years of work, huge investments, massive feedback, and the use of language models with trillion parameters, it’s really lame. Try searching for a bike below a certain price and certain weight. You’ll get results for bikes above that, but okay, then you can fix that using the shopping filter. Currently, that will not allow you to specify the weight even though it’s available in most technical specifications published online. But even if they add it at some point, the final selection will still exclude the majority of the offerings by smaller companies. As a result, you can’t get an answer to this simple question.
A way out is to decouple data and applications.
This was for the web. Now for enterprises.
In enterprises, for decades the applications were built in a way that the data model is separate for each application, trapped inside it, and the interpretation of the data is in the application code. The applications themselves are built based on historical functional requirements. When some change needs to be done, coming from new business needs or changed legislation, it takes months, costs huge amounts of money, and leads to increased complexity and technical debt. The same when two or more application needs to be integrated. This is the application centric-way of building applications. It is dominant to this day. Most big enterprises have thousands of application silos. They try to integrate the data through data warehouses, data lakes, point-to-point interfaces and APIs. All these methods provide partial and temporary solutions and add to the technical debt.
A way to solve this is to decouple data from applications.
A lot more can be said about what data-centric means at the enterprise level. If you have a bit of time to learn about it, I’d recommend watching this video. If you have more time, it’s worth going through The Data-Centric Revolution and if you have even more, read it after its predecessor Software Wasteland to get a better understanding of the size and the nature of the problem. If you have spent more than a couple of years in a big organization, you’ll find many familiar patterns.
We have this problem not only on the open web and in enterprises. We also have it with our personal information management. Our emails are trapped in one application, our documents in another, and then we keep our bookmarks disconnected from them inside the browser. We use one application to search on the web, and another to search our files. Now we can combine it but only if we forfeit our freedom to choose operating system and browser. If we look for something we communicated in writing, we have to remember where we wrote or read it. If we don’t, we have to search our files, our email, Twitter, Facebook, and the web. When we write a Word document we have to open it with MS Word. But when we are in Word we don’t have access to our tasks. For that, we need to go to another application.
A way to solve this is to decouple data from applications.
At all these scales, societal, organizational and personal, when it comes to managing information, we have similar kinds of problems coming from the tight application-data coupling or platform-data coupling. I will focus on organizations from now on to the end of this article, but it’s important to keep the bigger picture in mind.
What does a digital transformation from application to data-centric enterprise look like? In a perfect world it would look something like this:
EKG stands for Enterprise Knowledge Graph. It is something that complies with these design and governance principles.
A more realistic, but still ambitious transformation, will keep the data of the current applications where it is but will have it duplicated (virtualised or streamed) in the enterprise knowledge graph where it will be living an independent life, together with its semantics.
However, all new applications should be built in a way that they “visit the data, perform their magic and express the results of their process back into the data layer”.
While data-centric is all about decoupling applications and data, by itself, as a goal, slogan and buzzword, is problematic. Data-centric has similar problems as the preceding waves of process-centric and service-centric movements. Even worse, the word itself can impede the transformation it promotes. When a data department in a big organization is promoting data-centricity inside the organization, that can be easily misconstrued as just being self-centric. If the idea is sold, there will be more budget for the data department. It’s like a planet faking strong gravity to attract matter so that it gets more mass and consequently stronger actual gravity.
What is actually needed is not data-centrism5, just decoupling data from applications. Even not that, loose-coupling would suffice. It’s also easy to explain. You want to run, but your leg is in plaster. Once the plaster is removed you can bend your knee and ankle, you can run, jump, you can walk in one direction and then abruptly change it. But no, loose-coupling is the language of SOA; it’s passé⁶. Nobody would listen. Not that there are many that hear the data-centric cries, but at least they stand the chance to be echoed off the more modern knowledge graphs and F.A.I.R principles.
It took a bit of explanation to cover data-centric. But it was needed, so I can now try to explain why it doesn’t go well with project and even less so with requirements.
What’s wrong with “project”?
Application-centrism is deep in both software engineering, project management, and corporate culture. Take for example the traditional CIO title. Chief Information Officer, the officer responsible for ensuring good management of information. How many do you know that actually do that? They are rather chief application offers, or technology officers or infrastructure officers, anything around information but not the information itself.
Another example is every IT project. It is centred on a certain solution. It is about solving a particular problem, based on collected historical requirements. The application focus of most IT projects ignores space and time. Initially, space (the enterprise) was not represented in the project team at all. This brought Enterprise Architecture to the rescue. But it created its own world of artefacts, often equally far from both business and IT. Time (the future) beyond the project period is not represented as if it will never come. This brought first SOA and later DevOps to the rescue. But DevOps were quickly put in the cloud trenches. Yet, the best that DevOps can do is ensure a smooth deployment, operations, and scalability. What they can’t do is change the core application architecture in response to changes in needs or legislation. For that, a new project will be needed. And it will bring the same kind of problems. IT Projects tend to create local optima and short-term solutions.
If you ask about “data-centric project requirements”, it assumes that data-centrism can be achieved within a project. Well, since it tries to deal with the mess created by project thinking, would it be really successful if managed in the same way? “We can’t solve problems by using the same kind of thinking we used when we created them”⁷. Not that projects are a bad way of managing the creation of something. But as it was borrowed from the construction business, it came with the full package. Now we talk about “building software”. This is not an innocent metaphor. Even if we imagine that it is and we take it here, a project might be a good way of organizing the building of a house but wouldn’t do for a city. And the horizon of data-centric, literally and metaphorically, in other words in space and time, is way too big to put into a single project. It needs something more strategic, a whole rethinking of the way IT is managed and governed.
What’s wrong with “requirements”?
Requirements are central in software engineering. There are well-established methodologies for elicitating, documenting, and tracking project requirements. There are also methodologies for estimating efforts based on requirements. Requirements allow for a split of responsibilities, and to get acceptance criteria. A whole profession, business analyst, was born and relies on the market demand for requirements. When it was realised that it’s not easy to fix things at the beginning of the project — people need to learn what they really want and what is feasible — some attempts appeared to allow flexibility. First, it was UP, then RUP, the version of Rational (later IBM), and that later evolved into DAD (Disciplined Agile Delivery, which I assumed long dead but just learned is still practised today). Of course, the Agile movement tried to remove the word requirement and replace it with much friendlier⁸ user story. And yet both traditional and Agile software engineering focuses on building applications, the features of which correspond to requirements or user stories.
Requirements are viral. They reach even outside software engineering, in disciplines trying to bring coherence to organizations. Take TOGAF, the most popular Enterprise Architecture framework. What you see in the center of its methodology is “Requirement management”.
But it is due to requirement engineering, among other things, that we ended up with silos. Requirements, and especially functional requirements, determine the application boundaries. The fragmentation is not logical, only historical. The way database designers decided what tables to create with what columns and how to associate them is corresponding, if the project was successful, to what was required (and how it was interpreted by the analysts and the person in charge of the database design). And all concrete requirements — mainly functional but even some non-functional — share the same characteristic:
They are all historical.
Requirements are about what is known, from the history of the business case, and by the people involved, by the time they are captured. You have a set of requirements, and you meet them by building an application. Then you have another set, and you build another application. Then it turned out they need to be integrated. So you capture the integration requirements and build an interface. Which is, by the way, yet another application. Then they become many, so you do data warehouse, data lake, virtualization layer or some fancier data integration architecture which in the end is just another silo.
All this is what data-centrism tries to deal with and avoid in the future. Corporate data architecture should not be shaped by known requirements⁹. It should be designed for unknown requirements. In the end, if some requirements for a data-centric architecture are needed, these are only non-functional requirements, the -ilities, like interoperability and scalability. But most of them are not project-specific and are already in the principles, like F.A.I.R, EKGF principles, and, of course, the data-centric principles.
(First published on StrategicStructures.com)
[1]: Check out all five principles data-centric principles.
[2]: After the publication of this article, the scope of the manifesto was adapted, and now it includes the other two scales.
[3]: This has many facets. Facebook can be looked at as a very successful aggregator or as a prime example of a new form of capitalism.
[4]: The centralization of the web is not only about the content but also about the infrastructure. The convenience of the cloud increased the dependency of both individual users and companies on the strategy and fate of a few powerful providers, namely Amazon, Microsoft, and Google.
[5]: It’s way better to put people in the center but wait, wasn’t that what brought the climate change. It seems that just like centralization, any kind of centrism does not solve problems –it just changed the nature of the problems we deal with.
[6]: When microservices became fashionable that was only possible because the original SOA ideas were forgotten and they were exactly about micro-services, only the SOA-in-practice was not.
[7]: This quote is attributed to Einstein although he never said it, but do we always need a famous person, any authority for that matter, to make something we like and believe in, weigh more?
[8]: It’s friendlier in the way it sounds, in the way it’s managed and in the fact that focuses on the experience of the user and not on the functionality of the tool.
[9]: This doesn’t mean that functional requirements and user stories do not have utility. In a world where data and applications are decoupled, they will not do any harm.