7 March 2022
Data intensive computing is a term forged to describe processing large volumes of data (usually terabytes or petabytes) generated every day and referred to as Big Data. The amount of data generated today is insane and therefore traditional ways of analyzing them are out of the question.
Dealing with big data brings up a lot of questions – how to ensure the correctness and completeness of accumulated data, how to scale an increase in load or how to manage the obsolete legacy systems?
Data intensive computing was created to tackle the problem of a lot of data by introducing some changes into data management. The first one is minimizing data movement, by allowing the algorithms to execute on the node where the data is placed. Another tool is allowing the run time system to control the scheduling, execution, load balancing, communication and the movement of programs. Moreover, data intensive computing is by definition capable of scaling and therefore accommodating any amount of data to meet the time critical requirements.
There are three crucial aspects to design of a system capable of processing big amounts of data: reliability (performing correctly even in the face of any kind of error and being safe against unauthorized movements), scalability (dealing with growth) or maintainability (designing it in a way that will be adaptable to new users).
What’s interesting, data intensive apps are a good example of a situation where a fault is usually a good thing. Fault is not the same as failure – the second one means that the whole system stopped operating, while fault is just one component of the system deviating. Since it’s impossible to reduce the number of faults to zero, the only solution is to create a fault-tolerant system, which should prevent failures. In such systems faults are triggered deliberately to make sure that the machinery can react appropriately and knows how to deal with those.
For instance, it can be predicted that at some point there will be a hardware fault – like a crash of a hard disc or a blackout. It is also possible to design a system to react – setting up discs in a RAID configuration or providing dual power supplies. It’s more difficult to anticipate software faults – the best way is to assume the possibility of a fault and try to prevent it by constant testing, process isolation and analyzing.
Long story short – everywhere.
The general increase in the amount of data leads to new technologies, practices and approaches. The volume of data needs innovations in terms of analyzing and storing. In past decades it was normal for organizations to keep their own storage infrastructure, but now the responsibility is being shifted to the cloud operator. And it’s used for a wide range of processes we witness everyday, sometimes without even realizing what kind of big data operation we’re dealing with.
The insane speed of generating new data forces the implementation of advanced analytics, machine learning and AI. Machine learning and AI are capable of spotting patterns and anomalies faster, they can also make predictions to optimize processes within the organization. Another issue is better data visualization, necessary to represent vast amounts of data in an easy to understand way. Just having data is not enough, if one doesn’t know what it means, how to interpret it and what can be done with it.
Big data operations include not only the fanciest, almost futuristic examples, such as autonomous drones or driverless cars – they are used in healthcare (smart watches monitoring the heartbeats of users give information on cardiovascular diseases) or GPS systems (you know – when the system knows where a traffic jam can be anticipated).
More importantly however – Big Data Analytics is changing continuously and it has to be taken into consideration by companies, who don’t want to be left behind. The possibilities are limitless and the one thing to remember is – data is power.