< prev | next >

Data Flow

Data Flow is a template for understanding and designing a Machine Learning sequence of data movement.

Related concepts include:

Data is used by Machine Learning functional group experts as shown below:

Data Flow Layers

Data passes through layers of processing as it is stored, refined, and prepared for use in Machine Learning Models and Applications.

Sources

Data sources include:

  • Company Internal Databases

  • Company Internal Files

  • Websites

  • Public Data

  • Smartphone Apps

  • IOT Devices

  • Commercial Data Aggregators

  • Point of Sale

  • Corporate Internal Processes

  • Social Media

  • Data Streams

Capture

Capture mechanisms include:

  • Website Scraping

  • Website and Smartphone Chat Dialogues

  • Website and Smartphone Form Submissions

  • IOT Device Interfaces

  • Commercial Data Aggregator Feeds

  • Corporate Internal Process Feeds

Pipeline

Pipeline processes include:

  • Data Ingestion

  • Data Temporary Storage

  • Data Subscription

  • Data Publication

Databases

Databases include:

ETLs

ETLs Include:

  • Extract Functions: pulling data from selected sources

  • Transform Functions: normalization, regularization, aggregation

  • Load Functions: saving data in formats for use in modeling processes

Models

Model type category examples include:

Applications

Application examples include:

  • Medical Diagnosis

  • Autonomous Vehicles

  • Chatbot Dialog

  • Image Recognition

  • Face Recognition

  • Product Recommendations

  • Churn Prediction

  • Malware Detection

  • Search Refinement

Functional Groups

Functional Groups are those organizations and clusters of professionals that participate in Machine Learning.

Functional Groups are discussed here.

Key Factors

Flow Continuity

Efficient and accurate Machine Learning processes require a data flow that is continuous and well managed. Reasons for this include:

  • environment change: the world, its population, technology, etc. is in a state of constant change which must be reflected in the data used for Machine Learning

  • constant testing and evaluation: Machine Learning models and predictions must be continually tested and evaluated to determine when and how to modify/update them to reflect environment changes

  • new applications: Machine Learning is evolving very rapidly and new applications require new data

Historical Data

It’s critical to retain historical data related to Machine Learning model data, training, predictions, alerts, etc. in order to:

  • measure accuracy: knowledge of how models perform outside of training and testing is a critical element in performance evaluation

  • detect model degradation: as the environment changes, model performance can decay requiring model changes and upgrades

  • demonstrate performance: reporting and visualizing performance data is important for justifying the investments needed to maintain Machine Learning growth

Data Storage

Data should be stored in a manner that makes it readily available for ETL processing for model training and prediction processing:

  • database stores: if possible, data for Machine Learning should be stored in a database for convenient access by ETL processes

  • data normalization: data should be normalized when possible

  • data update: data should be updated in as close to real-time as possible for use by production models

Data Tiering

Implementing a tiered storage strategy provides optimization options for different types of data:

  • Hot data: Kept in high-performance storage for quick access

  • Warm data: Moved to lower-cost storage options

  • Cold data: Archived in very low-cost storage for long-term retention

Automation

Maintaining a rapid and efficient flow of data is not possible without significant levels of process automation due to the:

  • volume of data: model training and performance evaluation requires very large volumes of data

  • velocity of change: rapid environmental change translates to rapid data change

References