Cassandra to ML Modelling – The Challenges and Approach
Machine Learning is modifying the way businesses figure out the best path forward. Organizations across different industry verticals are leveraging the technology to accelerate decision-making, mitigate risks and enhance customer experience. However, Machine Learning requires the appropriate underlying technology framework, such as Data Lake on AWS, which breaks down silos and helps scale up or down easily. This blog talks about a use case where one of our customers managed to implement  Data Lake on AWS cloud and used it to create predictive ML models. At present, the customer uses self-owned trucks for the transportation of finished vehicles in India for OEMs. These trucks are modern, technology-enabled with various sensors like GPS, Fuel, Engine Telematics Sensor, Door Sensor, Fast tag sensor, Tyre Pressure and Temperature. This data was stored in Cassandra Database. The customer was facing the below challenges:
  • Cassandra was efficient in write operations; however, the OLAP requirements were not being met by the database.
  • Scaling was becoming difficult due to the increasing volume of data being ingested and stored. The company was also planning to increase the number of vehicles whose telemetry data had to be taken in.
  •  The variety of data was also increasing, and pre-processing and data transformation was required.
  • The business required long term storage, governance and query strategy to gain business insights and improve the efficiency of the operations.
  • The client needed an easy and efficient reporting and query tool for ad-hoc analysis.
  • The solution had to be cost-effective to justify the ROI from the data analyzed.
Use Case Overview

The customer was looking for a solution that could create ML models to predict the shipment arrival time and plan and schedule other shipments.  They wanted a cost-effective solution to justify the ROI from the analyzed data.

The team at Motherson Technology Services analyzed the problem, associated data and identified a solution to create a data lake solution on AWS to handle the scale, volume and variety of the data. A cost-effective approach using managed services on AWS was proposed and implemented. Following are some of the key features of the implemented solution:

  • Spark Cassandra Connector from Datastax was used to access data from Cassandra DB using EMR on AWS.
  • Lambda was used to invoke EMR Cluster to trigger the job for data ingestion (Used Spot instances for Core Nodes in EMR to save cost)
  • We created Spark jobs on EMR Cluster for One-time migration of all initial data of Cassandra tables (few specific tables) to Amazon S3 (ETL Step). Subsequently, the daily job was set up to handle the incremental load.
  • We handled the data cleaning, transformation, and conversion of the data format to Parquet in the Spark job. Parquet being columnar storage proves cost-effective at a later stage while querying the data stored in the data lake.
  • Data was created into partitions on data timestamp and other attributes. We used a configuration file that allows changes to happen dynamically so that if partitions need to be changed in the future, we can do without modifying the code.
  • Cloud watch events were scheduled to trigger Step Function, which further triggered Lambda, post which Lambda invoked EMR Cluster to run Spark Jobs for daily Incremental Data transfer. (Scheduler was configured for 7 AM, which reads previous day data from Cassandra)
  • When Lambda completed the Incremental Data transfer, Control came back to Step Function, which then invoked SNS to send an e-mail notification to the user stating that the job was completed or failed.
  • We used AWS Lake Formation to set up and secure Data Lake of S3 buckets to create security & governance.
  • Amazon Athena queries were created to query/analyze data. Glue Job was used to catalogue the data stored in Amazon S3, followed by data querying using Amazon Athena.
  • Using Data Lake, we created various dashboards using Amazon Quick Sight that helped the customer gain business insights, improve efficiency and optimize costs.
  •  Using Data Lake, we also created an ML model for estimating the arrival time for a shipment and tried algorithms like Random Forest and XGBoost, thereby finally opting for Random Forest as it gave better results.

Fig 1. Architecture Diagram -1

Several Previous Approaches Which Posed Challenges
Before arriving at the final solution, we researched multiple approaches as mentioned below. However, these approaches were not successful, and we faced challenges with each one. Let’s dig deeper into these methodologies and look for why they failed.
AWS Database Migration Service (DMS) helps set up and manage a replication instance on AWS. In this approach, we explored the AWS Data Migration Service for the Incremental transfer of data and used the AWS Schema Conversion Tool to set up on the database side. Yet, it could only perform Change Data Capture (CDC) to AWS Dynamo DB and could not carry out incremental data transfer to AWS S3.
Why Didn’t This Approach Work?
Due to a lack of direct support for Cassandra, we had to drop the DMS approach. The option to select Cassandra as a source in DMS is not available as of now. Also, for the incremental transfer of data from S3 to DynamoDB, CDC is not supported, which was the customer’s primary requirement.
  • Debezium Server
Debezium provides a ready-to-use application that streams change events from a source database to messaging infrastructure like Amazon Kinesis, Google Cloud Pub/Sub or Apache Pulsar. We explored Debezium Server for the incremental data transfer from Cassandra to S3 for creating a data lake. Interestingly, we were able to work this out as we had streamed data from kinesis to S3 in the data lake.
Why Didn’t This Approach Work?
As per the official documentation of Debezium Server, Debezium is incubating, which further implies that the semantics, configuration options etc., might change in future revisions, which is why it is not recommended for the production purpose.
  • Debezium Connector for Cassandra
The Cassandra connector is capable of monitoring a Cassandra cluster and recording all row-level changes. To do this, the connector must be deployed locally on each node of the cluster. As soon as the connector gets linked to a Cassandra node, it takes a snapshot of all CDC-enabled tables in all keyspaces. The connector also reads the changes written to Cassandra commit logs and creates the corresponding insert, update, and delete events. All events for each table get recorded in a separate Kafka topic, where applications and services could consume them.
Why Didn’t This Approach Work?
In this approach, an initial setup is required from the client-side, but the semantics, API configuration keeps changing, which is why we could not risk using this approach for production deployment.
  • AWS Data Pipeline
AWS Data Pipeline helps users define automated workflows for the movement and transformation of data. In other words, it helps with reliable processing and moving data between different AWS compute and storage services, as well as on-premises data sources. We implemented the AWS Data Pipeline solution, which synchronizes all the activities from launching EMR Cluster, running Job Flow for incremental data transfer, retry mechanism and sending the notification on a successful run of the job or in case of some failure.
Why Didn’t This Approach Work?
As this service is not available in Mumbai and the database is currently located in the Mumbai region, it incurred the inter-region data transfer cost. Therefore, to eliminate this cost overhead, we decided not to move forward with this approach instead of using some other services for the same purpose. All in all, to overcome the challenges in the above-discussed approaches, we used Spark Cassandra Connector by Datastax for accessing the customer’s data from Cassandra DB using EMR on AWS. The Final Note Overall, in this blog, we learnt that data partitioning improved the data query performance, and the correct calculation of the nodes is required for data pre-processing in EMR Cluster. The parquet format also gave us several benefits, like less space needed for data storage. Moreover, we showed how we ingested one-time 7 GB data and incremental 600 MB data from Cassandra to S3 by creating a data lake in S3 and using Glue for its corresponding data lake schema. Further, we analyzed the data in S3 by running queries through Athena and successfully created a data lake on AWS from the Cassandra database. This led to a 90% improvement in analysis and query time for data and a 20% cost saving for driver utilization improvement. Also, there was a 15% improvement in the on-time shipment of vehicles, which further improved customer satisfaction and savings in delivery penalties. With the help of the machine learning model, the predicted ETA(estimated time to arrival) using the Random Forest algorithm increased overall customer satisfaction, which then resulted in many intangible benefits. For instance, it helped increase trust in the brand, which eventually leads to customer loyalty.

About the Author:

Dr. Bishan Chauhan – AI / ML Practice Head

Umesh Taneja – Senior Cloud Engineer, AI / ML Practice

Pulkit Garg – Data Scientist, AI / ML Practice


Trends and insights from our IT Experts