Extract Functions
Extract functions can include:
Database Queries
File Reading and/or Data Selection
Transform Functions
Transform functions can include:
Load Functions
Load functions can include:
Database Loads
File Writes
Tools
ETL processes are often constructed and performed using a custom collection of tools.
Below are some references to tools oriented to ETL processing:
Amazon AWS Glue: a managed ETL service
Apache Hadoop: an open source software library for distributed ETL processing ETLs across clusters of computers
Dask: scalable, Pandas-like functionality
MapReduce: is a programming model and an associated implementation for ETL processing
Microsoft ETL Tools: various related tools to perform the ETL process
Pandas: a Python based language for various type of data manipulation
SQL: statements to process data, usually from relational databases
Key Factors
Key factors include:
Process Automation
Process automation is necessary to provide:
real-time prediction input data: performing real-time predictions requires real-time input data
timely model training input data: model training is an iterative process requiring a series of ETLs
Comprehensive Data Access
ETL processes should provide comprehensive data access for:
effective model training, predictions and evaluation: data is the life blood of Machine Learning
research for new applications: Machine Learning is evolving rapidly requiring access to new data and data sources