Faster and broader analysis through Machine Learning and Real-Time Analytics
By Christophe Rivoire, UK Country Manager, and Jean-Philippe Rezler, Lead Solutions
- The amount of stored data is going to triple by 2026, and companies need to be able to extract value from their data in real-time.
- Companies are adapting their strategies through machine learning, but the size of the data sets and technical complexities of implementing ML are common challenges.
- Opensee’s solution for Big Data analytics enables real-time access to a large set of data, and helps create and apply machine learning models for faster and more accurate analysis.
The Big Data impact on the world.
How often have you seen Big Data’s anticipated impact on everyday lives being compared to the transformational effects of oil during the 20th century? The total amount of stored data is going to triple from 2021 and 2026 by most estimates and, like oil, it needs to be processed and refined to deliver value and drive change.
Data lies at the heart of any enterprise, helping organisations to make better business decisions. In the Financial Industry, data is also central to all the mandatory regulatory reporting. All these reasons have forced companies to manage the increasing amounts of data they produce in their day-to-day activities; now they are focused on enabling their employees to extract value from their vast quantities of stored data, ideally in real-time and without compromising on the level of granularity required for processing or analysis.
Companies are adapting their strategies through machine learning.
Due to the size of the data sets, companies are in parallel implementing automated data processing and identifying patterns which they translate into valuable insights for enhanced decision-making. These “machine learning” systems adapt over time as they get to grips with the data fed to them.
In the Financial Industry there are multiple examples of this at work – from the certification process of a risk framework and fraud detection to algorithmic trading and execution.
However, Machine Learning (ML) projects, and more broadly Artificial Intelligence projects, often face upstream and more basic hurdles which relate to the underlying data sets. Accessing and understanding them and the related technical complexities are among the things which are very much “Big Data” challenges.
An overview of the machine learning development cycle.
The development cycle of ML models is typically presented in three steps:
- Data acquisition and pre-processing
- Model building and training
- Model testing and deployment
After the development stage, we can add another step which concerns the usage and monitoring of the models, and the subsequent sharing of the findings and data. Before we get to this, let’s first consider the challenges and solutions in the various steps of ML development.
Data acquisition and pre-processing
The first step of any ML model is the capacity to receive data in a structured or semi-structured format. This will be used by the models afterwards but, most importantly, this will be used by data scientists or anyone who wants to extract trends. Meaningful information filtered by this “expert” view can then be used to build relevant ML models. Data quality is important in this phase. No technology can magically “create” this quality, so transparency and accessibility are key.
In practice, having a real-time Big Data analytical platform should allow users to access, visualise, potentially correct and handle a large set of data quickly and easily. This facilitates and accelerates the phase of creating meaningful data analysis and preparation for relevant ML models. Data scientists, data experts as well as “standard users” with business knowledge can then provide inputs to the models which are critical.
Model building and training
ML models need to use structured and often normalised numerical data, involving a lot of back and forth for testing, hyper parameters optimisation etc…
Access to structured data in real-time facilitates the journey of model analysis, optimisation and adjustment.
Model testing and deployment
Once the model is ready for testing and then production of the prediction that users can access quickly, it needs to be deployed in an environment where data can be loaded to the model efficiently. Deploying the model at the very same place where data is stored and usually processed to produce dashboards, makes the computation and the use of prediction much faster and more efficient.
A step-by-step guide to training machine learning.
When the development cycle is completed and the model is in production, the learning is done but monitoring of the model is critical as it can become less accurate over time, when production data drifts away from the training data. Maximising the number of comparable data coming from various sources is a very efficient approach to minimise this well-known risk. Indeed it is a way to improve the learning process and take into account the handling of unusual cases, in turn making the model more robust and reducing the risk of drift. This requires an extremely scalable platform to deal with potential multiple data sources.
Within this context we can now consider how to train an ML model and deploy it in production. At Opensee we approach this by offering the fully integrated functionality to build “Calculator” UDFs in Python (Python User Defined Functions). Two “calculators” are needed. The first uses the Pandas DataFrame to train the model and save it, while the second will use the stored model and the new input data of the model to carry out the prediction and return the result.
In the first phase of data acquisition and pre-processing, two data tables within the same module need to be created:
- At least one data table: This table will contain the data used to train and test the model. If you want to use data that is already in the Opensee platform to train your model, you will have to extend the module of the table that contains your data.
- A models table: This table will store your trained models for use during the prediction process.
The second phase around training your model uses Opensee’s calculator as well sending back the trained model as result.
After using the first calculator to train your machine learning model, and pushing the model in Opensee, you’ll be able to use it to predict from any other given data. For that you’ll use the prediction “calculator” which can be easily called in the UI.
In this second calculator two queries will be performed: one to retrieve data for the prediction and one to retrieve the trained model which requires data and models tables to be in the same module. The calculator will then run the model on those data and send back the predictions that can be then plugged to any relevant dashboard.
Opensee’s solution embraces machine learning for Big Data analytics.
There are many use cases in the Financial Industry where Opensee’s real-time Big Data analytics allow users to perform simple tasks to the more complex steps leveraging ML. It enables real-time access, the visualisation and handling of a large set of data for simple checks, while its UDF Python module helps create and apply highly relevant Machine Learning models. This end-to-end process at the place where data is stored allows data experts and business users alike to have full confidence in using the models by significantly reducing the time required for creating the model and, more importantly, using it.
At Opensee, we’re all about helping users create value from Big Data in the 21st century. Embracing the convergence between analytics and machine learning within our solution is a clear example of expanded opportunities which help our clients to analyse faster and smarter.