Data Lake and Pipeline Application Built by a Leading Data Science Firm to Store & Transform Data

Digital Transformation
The Challenge

A world-class AI and analytics data science firm needed to create unique, innovative solutions for their clients. To achieve this, their data scientists required two critical items:

  • A robust Data Lake capable of handling terabytes of diverse and complex datasets.
  • A data pipeline application to alter raw customer data, public data, and third-party data into standardized and usable datasets.
Solution

– Intellects Group’s engineers stepped in to develop modular Python applications designed to ingest, cleanse, parse, enrich, and transform raw data. This processed data was then stored in AWS S3, with AWS Airflow (Apache) serving as the orchestration engine for the Data Pipeline (ETL). The transformed datasets, structured in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) – a healthcare OHDSI standard, then became a commercially offered product for the client.

– To handle the massive datasets, each ranging from 300 GB to 500 GB, our engineers utilized performance-optimized, vectorized Python with Pandas and other advanced data analytics libraries. This ensured efficient and heavy-duty data science processing.

– The first Cloud-based Data Platform (CDP) was built using Databricks (SPARK on AWS). This platform supported on-demand access to transformed data for the purpose of ad-hoc analysis, hypothesis testing, Exploratory Data Analysis (EDA), derivative dataset generation, and Machine Learning (ML) model development. The solution also enabled seamless multi-cloud interoperability for the Data Lake between the AWS data warehouse and GCP BigQuery. Additionally, the platform was integrated with Tableau for visualization and Immuta as the core of data governance.

The Result

Intellects Group proved to be the ideal partner for this cutting-edge AI and analytics firm. We complemented their expertise, working as equals to deliver a solution that met the strict requirements of a hedge fund-funded AI startup operating in a fast-growing, high-value data science market.

Category: Digital Transformation

Related Case Studies