Extract Transform & Love
Why ETL?
In today's fast-paced digital world, organizations rely heavily on data to make informed business decisions. Extract, Transform, and Load (ETL) processes play a crucial role in the collection, processing, and storage of this data. As the sole manager of over 500 running integrations, I provided technical documentation and precise steps to follow for clients and enabled a monitoring system available internally to any account manager or customer support rep. In this blog post, I will discuss my contributions to the ETL process, including the refactor of the integration approach design, the creation of a secure storage location for client data, and the addition of a metrics service for easy progress tracking.
Refactoring old systems & Setting new standards
Originally the integration approach was to map all school database 1:1 to our data source in order to feed the data into specific data fields of a mysql database. This approach worked but normally took ~3 months to complete an integration. After conducting market research and holding client sessions, it was found that most higher education systems can export data automatically to CSV format. Therefore, a new approach was explored to minimize the time and effort required from clients to provide data. To address this challenge, I led the redesign of the integration approach to focus on minimizing the time and effort from both parties.
After careful examination of what data formats were most commonly used by higher education systems, the decision was made to use CSV data. This helped to streamline the integration process and allowed clients to easily export their data to the application. A secure storage location was required to minimize initial setup and configuration time so sftp was maintained via AWS Transfer Family.
The internal integration service was created to provide a secure location for clients to store their data. This was important to ensure that data is kept confidential and is only accessible to authorized personnel. Technologies used included SFTP & S3 for secure storage and PostgreSQL for database management. The backend service extracts CSV data via an integration run against a kubernetes resource, triggered by a cron job. The data is pulled from a server location, transformed, and normalized according to the specified environment. Once mapped, the data is loaded into the database, becoming available for application use. This ensures that the data is easily accessible and can be used to generate reports or provide insights. The average integration timeline per client decreased, going from an average of 3 months to 3 weeks to be fully implemented.
Monitoring needs to be easy
To enable easy progress tracking, a metrics service was added to the frontend. This ensures that our team can easily track the progress of the integration process and make any necessary adjustments. This frontend service was given a front-end list view to log the information occurring during the integration run. Technologies used included Ruby on Rails as a framework for building the integration service and SQLite for manipulating and managing the data. The metrics service provides real-time feedback on the performance of the ETL process. It tracks the number of successful and failed integrations, the number of records processed, the number of records updates, and the time taken to complete the integration. The data collected by the metrics service is used to identify areas for improvement and make necessary adjustments to the ETL process.