- Cassandra was efficient in write operations; however, the OLAP requirements were not being met by the database.
- Scaling was becoming difficult due to the increasing volume of data being ingested and stored. The company was also planning to increase the number of vehicles whose telemetry data had to be taken in.
- The variety of data was also increasing, and pre-processing and data transformation was required.
- The business required long term storage, governance and query strategy to gain business insights and improve the efficiency of the operations.
- The client needed an easy and efficient reporting and query tool for ad-hoc analysis.
- The solution had to be cost-effective to justify the ROI from the data analyzed.
The customer was looking for a solution that could create ML models to predict the shipment arrival time and plan and schedule other shipments. They wanted a cost-effective solution to justify the ROI from the analyzed data.
The team at Motherson Technology Services analyzed the problem, associated data and identified a solution to create a data lake solution on AWS to handle the scale, volume and variety of the data. A cost-effective approach using managed services on AWS was proposed and implemented. Following are some of the key features of the implemented solution:
- Spark Cassandra Connector from Datastax was used to access data from Cassandra DB using EMR on AWS.
- Lambda was used to invoke EMR Cluster to trigger the job for data ingestion (Used Spot instances for Core Nodes in EMR to save cost)
- We created Spark jobs on EMR Cluster for One-time migration of all initial data of Cassandra tables (few specific tables) to Amazon S3 (ETL Step). Subsequently, the daily job was set up to handle the incremental load.
- We handled the data cleaning, transformation, and conversion of the data format to Parquet in the Spark job. Parquet being columnar storage proves cost-effective at a later stage while querying the data stored in the data lake.
- Data was created into partitions on data timestamp and other attributes. We used a configuration file that allows changes to happen dynamically so that if partitions need to be changed in the future, we can do without modifying the code.
- Cloud watch events were scheduled to trigger Step Function, which further triggered Lambda, post which Lambda invoked EMR Cluster to run Spark Jobs for daily Incremental Data transfer. (Scheduler was configured for 7 AM, which reads previous day data from Cassandra)
- When Lambda completed the Incremental Data transfer, Control came back to Step Function, which then invoked SNS to send an e-mail notification to the user stating that the job was completed or failed.
- We used AWS Lake Formation to set up and secure Data Lake of S3 buckets to create security & governance.
- Amazon Athena queries were created to query/analyze data. Glue Job was used to catalogue the data stored in Amazon S3, followed by data querying using Amazon Athena.
- Using Data Lake, we created various dashboards using Amazon Quick Sight that helped the customer gain business insights, improve efficiency and optimize costs.
- Using Data Lake, we also created an ML model for estimating the arrival time for a shipment and tried algorithms like Random Forest and XGBoost, thereby finally opting for Random Forest as it gave better results.
Fig 1. Architecture Diagram -1
- AWS DMS
- Debezium Server
- Debezium Connector for Cassandra
- AWS Data Pipeline