From Moscow to Wall Street: the remarkable journey of ClickHouse
By Onur Cetin, Senior Sales Executive (UK)
If you come from a Financial Services background, ecosystems, communities and open-source are probably exotic ideas that you discovered later in your career. People used to closely guard their technologies, there was no publication or open discussion, and bankers in general wouldn’t really know what others were doing. But things are changing in this siloed culture and we are all gradually discovering the benefits and joys of collaboration, not least thanks to technology.
In data management, there is one open source database technology that is currently the talk of the town: ClickHouse Inc.
It was spun out of Yandex, Russia’s leading search engine, only a few years ago. You might have seen the news in October that it raised $250 million (after a first $50 million a few weeks before) to give it a $2 billion valuation. After incorporating in California, most of the team moved to Amsterdam, active users include Uber and eBay, and the consecutive Series A and B funding round was led by some of the largest and well-recognised venture capital firms. While this might sound like business as usual in the crowded data market, the story so far is remarkable.
ClickHouse was designed to scale both vertically and horizontally, and handle real-time queries on massive datasets, a natural requirement in a search engine. It is now capable of processing 100+ petabytes of data with more than 100 billion records inserted every day, hundreds of megabytes of data per server, per second. Benchmarks show 100-1000X speed improvement compared with traditional approaches.
So here we have a column-oriented database like Vertica, Amazon Redshift and Sybase IQ, to name but a few. But it has many ‘new’ tricks paying attention to low-level details, so a challenging task like real-time processing is just like second nature. A good example is avoiding generic implementations. Consider a hash table, a key data structure for GROUP BY. ClickHouse automatically chooses one of 30+ hash table variations for each specific query, a far more effective approach than a generic choice. More importantly for the Financial Industry, natively supporting arrays is critical for real-time analytics.
The bottom line is that ClickHouse enables you to use more or even all of your data irrespective of your industry. It is simply the definition of a future proof technology.
As one of the earliest users of ClickHouse, this is more of an old news story for us at Opensee, where we are constantly searching for new technologies to optimise real-time analytics at scale in financial services.
Thousands of users around the world made this possible, and we carried the flag for financial services. It is pleasing to see efficient markets theory playing before our eyes: a technology with a relentless focus on speed and optimisation cutting through the noise and eventually getting the recognition.
We discovered ClickHouse while dealing with credit risk cases with huge files of Monte Carlo simulation results at tier-1 banks. The fact that ClickHouse is a cheaper-to-scale on-disk solution with a performance matching the older in-memory solutions, with their larger hardware budgets, was already a good start. Then there was the simplicity and efficiency: no need for a dozen pre-aggregating and tiering services (e.g. Druid) and no need to maintain daily/hourly/minutely tables (e.g. Hadoop, Spark).
Opensee provides financial institutions with a real-time self-service analytics platform. The architecture that we chose was a back-end written in Scala because of its capacity to interpret a domain specific language, distributed computing with Python calculators spawned on multiple servers, and a user interface to provide an intuitive interaction to the users. These technologies provide an environment that simplifies the access to the data with fast aggregations, drill-downs, user defined functions in Python and respecting permissioning.
For example we have implemented an abstraction layer which hides the complexity of the underlying physical data model to the end users. Features such as star models, columns of facts or dimensions as arrays and python based metrics are exposed via our APIs or UIs using a denormalized (flat) and user-oriented data model. Another example is the data versioning allowing the users to run what if scenarios with full traceability.
ClickHouse was the perfect engine to power the solution and serve the data. Opensee frames its complexity with no-code access without sacrificing performance and reducing implementation costs and time drastically. Finally, as cloud is not the solution to all the problems in the banking world, we had to come up with a creative hybrid cloud/on-premise implementation (for more on this, please see an earlier blog post by my Opensee colleagues).
I don’t need to dwell on how collaboration helps innovation and for those interested to learn more about our ClickHouse journey, please contact us for more details. It really is a good feeling to know that people, perhaps far away, are improving the technology we use during the working day as well as afterwards, when we auction something on eBay or click on the Uber app to get home.