News
27 April 2022
News
Готово!
Скоро материал придет на указанную электронную почту. Также подписывайте на нас в Facebook
Ok
Big Data: Cost, Storage and Developers
The biggest challenge is finding a balance between three aspects: current tasks that ensure business value of the data, prospective tasks and the cost of storage and processing.
In the era of mass following, cold sales and spam call identifiers (e.g., by Tinkoff or Yandex), everyone is aware of big data. Let us consider the pressing issues of big data in detail.
Some may ask «how large» or, as a colleague of mine used to say: «How much of that data has to be to make it big?» I normally believe that as soon as the amount of data exceeds several terabytes, we can say that it is really much data and we should think about the tools to store it and process it intelligently. Clearly, an integrated indicator is required here—the one that deals with the data type, amount and transmission speed, as well as the complexity of processing operations. However, for the sake of simplicity, we can consider only one value—the volume.
— data volume and type;
— tasks to be solved using the data;
— the required data processing speed;
— variety of the input data;
— cloud availability;
— data maturity of the company.
Please note that we do not consider technology and everything related to it since technology is something that results from the above issues.
Firstly and most obviously, storing 10 TB, given equal conditions in terms of technology and absolute values, is clearly cheaper than 100 TB.
To my mind, tasks are the main cost driver which needs to be offset by the profits the business makes from using data. Obviously, if the data consumer is an analytical department, which needs to be provided with quality data marts, or ML models, which require complex aggregates, the complexity of data processing is times higher than in the case when you need to ensure cold archive storage. And the processing complexity requires processing capacity and memory. Probably, a cost multiplier as well, if the customer requires «five 9s». On top of that, the more complex ETL system is required, the more expensive it is to build and maintain it.
The required data processing speed. If you need to provide data streaming for a dashboard, which shows updates once a minute, or an antifraud model, which detects frauds in real time, it will be much more expensive than data marts, which can be updated once a week on weekends, or a cold archive, where all the data is dropped once a month. So, basically, speeds adds to expensive (especially in recent times) requirements for processing capacity and, even more, memory and affects the producibility of the solution which increases the cost drastically as well.
Dealing with various input data affects the technological complexity of the solution and, therefore, causes additional expenses on licenses, advanced requirements for specialists and support for the solution in general. When you need to work with data of various types, for example, relational tables or graph data and documents, it becomes clear that the technological complexity of the solution will be much higher than when, once again, you create a good (or maybe not) old cold archive from several fully relational sources.
Using the cloud for data storage may be very tempting: you pay as you use it, you shift capital costs to operating costs, you don’t have to think about inventory (when you buy the system), space, electricity and lots of other nuances. However, cloud storage is not suitable for specific types of data (regulatory requirements must be respected) and not applicable to specific tasks. Generally, clouds are alright, but it takes a very competent architect to know where to draw the line between cloud and in-house storage.
So, when selecting a tech stack and the architecture of the storage organization, one normally pays attention to these aspects as it is them that affect the basic cost of data storage and processing. To evaluate the final cost of storage and processing, it is necessary to consider a factor that has not yet been disclosed—data maturity of the company.
However, once you start implementing data governance practices, your maturity level grows but, at the same time, your data management expenses grow as well: for developing and implementing those practices first and then for supporting and maintaining them. It is not quite important whether you begin with the role model of data owners, a data steward institute, creating and maintaining a business glossary, data quality assurance or other activities. As you may imagine, the growth is not linear but exponential, that is, each successive level costs more than the previous one.
One may think that data governance practices are no good, nothing but unnecessary overhead. Of course, it is not true. They improve the quality, availability and manageability of data. As a result, the quality of the data product that brings value to the business grows and, therefore, the data product becomes more valuable. Alternatively, the speed of creating a new data product and acquiring new business value increases. Thus, in terms of the level of maturity, it is equally important to find the balance between cost and value that suits both parties.
After all, there’s obviously no point in ramping up the maturity to Level 5 if the business doesn’t understand how to capitalize this nice and fast data described to the letter. A colleague of mine used to say: «Every attribute in your data mart costs a specifically allocated amount of money. Show me what effect you get out of them.» And that is the right question, although it can sometimes be extremely difficult to find an answer to it.
And maintaining the balance is, contrary to a popular belief, a duty of a number of specialists, not the data director. Business analysts or ML practice leaders, i.e., the people who translate data into recommendations, forecasts, dashboards and other data presentation formats the business needs, are in charge of the value, both current and, mainly, prospective. In an ideal scenario, they are in charge of setting tasks for the data office that relate to new data marts, quality improvements and faster data processing and complementing these requests with expected (or real) economic impacts.
And the tasks of the data office, in turn, are as follows:
— Understanding the business problem or requirements that are imposed on the data;
— Finding optimal answers to the questions presented at the beginning of the article, that is, minimizing the cost of the solution at the required level of quality;
— Ensuring the right balanced level of data maturity in the company.
Still, this is an ideal scenario when there is a competent analytical service and the business understands what data is and how to work with it. In real practice, the CDO often has to approach business customers throwing light on what more benefits can be obtained from using the data now and what can be done if they change the technology platform, implement data governance, shift to real-time data processing, etc. Since these are mainly non-core duties of the CDO and the business is not ready, the result is not always what we expect.
ICL Group specialists are well aware of how business value is created from data and how to work with data at all levels: from storage organization and warehouse selection and implementation to analytics and machine learning, where the core value is born from data, as well as at the level of business, which must understand, accept and capitalize that value. Our team’s approach to working with big data is through consulting: from understanding the business problem and potential business values, through pilots and MVPs to test hypotheses to selecting solutions (not just storage, but full-stack solutions: from storage to analytics and ML), their implementation and further scaling. Due to this approach, our customers acquire not just a quality technology solution, but a quality technology solution that meets business needs and improves the enterprise performance.
The etymology of «big data»
The term «big data» can be interpreted in different ways depending on the purpose. Some sources refer to it as a colossal amount of unstructured data while the others imply tools and approaches to that data. I, for one, tend to think of it as nothing more than a large volume of data in an arbitrary form.Some may ask «how large» or, as a colleague of mine used to say: «How much of that data has to be to make it big?» I normally believe that as soon as the amount of data exceeds several terabytes, we can say that it is really much data and we should think about the tools to store it and process it intelligently. Clearly, an integrated indicator is required here—the one that deals with the data type, amount and transmission speed, as well as the complexity of processing operations. However, for the sake of simplicity, we can consider only one value—the volume.
Where does big data «live»? How much does it cost?
These questions are definitely related to one another. So I suggest we begin with the costs. There are lots of storage options, but optimization and eventual choice are often driven by the cost. Essentially, the cost of data management is based on:— data volume and type;
— tasks to be solved using the data;
— the required data processing speed;
— variety of the input data;
— cloud availability;
— data maturity of the company.
Please note that we do not consider technology and everything related to it since technology is something that results from the above issues.
Firstly and most obviously, storing 10 TB, given equal conditions in terms of technology and absolute values, is clearly cheaper than 100 TB.
To my mind, tasks are the main cost driver which needs to be offset by the profits the business makes from using data. Obviously, if the data consumer is an analytical department, which needs to be provided with quality data marts, or ML models, which require complex aggregates, the complexity of data processing is times higher than in the case when you need to ensure cold archive storage. And the processing complexity requires processing capacity and memory. Probably, a cost multiplier as well, if the customer requires «five 9s». On top of that, the more complex ETL system is required, the more expensive it is to build and maintain it.
The required data processing speed. If you need to provide data streaming for a dashboard, which shows updates once a minute, or an antifraud model, which detects frauds in real time, it will be much more expensive than data marts, which can be updated once a week on weekends, or a cold archive, where all the data is dropped once a month. So, basically, speeds adds to expensive (especially in recent times) requirements for processing capacity and, even more, memory and affects the producibility of the solution which increases the cost drastically as well.
Dealing with various input data affects the technological complexity of the solution and, therefore, causes additional expenses on licenses, advanced requirements for specialists and support for the solution in general. When you need to work with data of various types, for example, relational tables or graph data and documents, it becomes clear that the technological complexity of the solution will be much higher than when, once again, you create a good (or maybe not) old cold archive from several fully relational sources.
Using the cloud for data storage may be very tempting: you pay as you use it, you shift capital costs to operating costs, you don’t have to think about inventory (when you buy the system), space, electricity and lots of other nuances. However, cloud storage is not suitable for specific types of data (regulatory requirements must be respected) and not applicable to specific tasks. Generally, clouds are alright, but it takes a very competent architect to know where to draw the line between cloud and in-house storage.
So, when selecting a tech stack and the architecture of the storage organization, one normally pays attention to these aspects as it is them that affect the basic cost of data storage and processing. To evaluate the final cost of storage and processing, it is necessary to consider a factor that has not yet been disclosed—data maturity of the company.
Why is data maturity so important?
Many people are aware or have heard of the maturity levels of different processes. Say, if it’s the IT industry, you can read ITIL—they are well described there. Similar, albeit still less formalized by now, levels of maturity exist in data management as well. I will not present them here, since there is no unified approach to them today. It takes a dedicated debate or paper to describe what has not yet become at least an actual standard. Therefore, I suggest that we equate data maturity levels to data governance maturity levels. If are not aware of data management and you have not had any practices, we shall assume you are Level 0 or Level 1. And, eventually, it means this «level of maturity» does not affect the cost of storage and processing at all.However, once you start implementing data governance practices, your maturity level grows but, at the same time, your data management expenses grow as well: for developing and implementing those practices first and then for supporting and maintaining them. It is not quite important whether you begin with the role model of data owners, a data steward institute, creating and maintaining a business glossary, data quality assurance or other activities. As you may imagine, the growth is not linear but exponential, that is, each successive level costs more than the previous one.
One may think that data governance practices are no good, nothing but unnecessary overhead. Of course, it is not true. They improve the quality, availability and manageability of data. As a result, the quality of the data product that brings value to the business grows and, therefore, the data product becomes more valuable. Alternatively, the speed of creating a new data product and acquiring new business value increases. Thus, in terms of the level of maturity, it is equally important to find the balance between cost and value that suits both parties.
After all, there’s obviously no point in ramping up the maturity to Level 5 if the business doesn’t understand how to capitalize this nice and fast data described to the letter. A colleague of mine used to say: «Every attribute in your data mart costs a specifically allocated amount of money. Show me what effect you get out of them.» And that is the right question, although it can sometimes be extremely difficult to find an answer to it.
Who is in charge of working with big data?
Considering the above, the biggest challenge of selecting a solution is finding a balance between three aspects: current tasks that ensure business value of the data, prospective tasks and the cost of storage and processing.And maintaining the balance is, contrary to a popular belief, a duty of a number of specialists, not the data director. Business analysts or ML practice leaders, i.e., the people who translate data into recommendations, forecasts, dashboards and other data presentation formats the business needs, are in charge of the value, both current and, mainly, prospective. In an ideal scenario, they are in charge of setting tasks for the data office that relate to new data marts, quality improvements and faster data processing and complementing these requests with expected (or real) economic impacts.
And the tasks of the data office, in turn, are as follows:
— Understanding the business problem or requirements that are imposed on the data;
— Finding optimal answers to the questions presented at the beginning of the article, that is, minimizing the cost of the solution at the required level of quality;
— Ensuring the right balanced level of data maturity in the company.
Still, this is an ideal scenario when there is a competent analytical service and the business understands what data is and how to work with it. In real practice, the CDO often has to approach business customers throwing light on what more benefits can be obtained from using the data now and what can be done if they change the technology platform, implement data governance, shift to real-time data processing, etc. Since these are mainly non-core duties of the CDO and the business is not ready, the result is not always what we expect.
ICL Group specialists are well aware of how business value is created from data and how to work with data at all levels: from storage organization and warehouse selection and implementation to analytics and machine learning, where the core value is born from data, as well as at the level of business, which must understand, accept and capitalize that value. Our team’s approach to working with big data is through consulting: from understanding the business problem and potential business values, through pilots and MVPs to test hypotheses to selecting solutions (not just storage, but full-stack solutions: from storage to analytics and ML), their implementation and further scaling. Due to this approach, our customers acquire not just a quality technology solution, but a quality technology solution that meets business needs and improves the enterprise performance.
Stay informed
Subscribe to our newsletter and keep up with our latest news
Contact us
Leave information about yourself and your company to get a detailed presentation.