Data for AI: Data Infrastructure for Machine Intelligence, by Scott Burk and Kinshuk Dutta
Artificial Intelligence (AI) is only as powerful as the data that fuels it, and this book is your comprehensive guide to understanding the critical data infrastructure that makes AI work.
Three Pillars of Accelerated AI Adoption
Focus on Data
Modeling Paradigms for DM
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation Phase
Deployment Phase
Data Mining and Supporting Data Technology
Data Preprocessing
Data Catalog Platforms
Data Technology for Visualization, BI, and Exploratory Data Analysis
Data Technology for Dimensionality Reduction
Data Technology Supporting Knowledge Graphs
Data Virtualization for Data Mining
Application Case Highlight – Data Mining to Support Accountable Healthcare
Summary
References
Getting Machine Learning Infrastructure Right
Technical Debt in Machine Learning
Model Dependencies
Data Dependencies
Feedback
Major Types of Machine Learning and Data to Support Them
Supervised Learning
Predicting Loan Default
Unsupervised Learning
Claim Clustering for Property and Casualty Insurance
Semi-Supervised Learning
Property and Casualty Insurance Example with Self-Training
Property and Casualty Insurance Example with Co-Training
Reinforcement Learning
Inventory Management Example
Ensemble Learning
Bagging (Bootstrap Aggregating)
Boosting Methods
Property and Casualty Insurance Example with Boosting
Section Summary
The Rise of Big Data and a Technology Paradigm Shift
Challenges of Traditional ML Architecture
Application Case Highlight – The Next Generation of Medicine – Predicting Asthma
Data Technology was the Barrier for Years
Medical Science and Acquiring the Right Data
Machine Learning for AI Asthma Models
Model Results and Applications
Multiple Applications
Case Management Application
SMS Text Alerts
Emergency Management Reports
Summary
References
The Role of Apache Spark in Enabling Deep Learning
In-Memory Computing and Distributed Computing
Integration with DL Libraries and Real-Time Data Processing
Ease of Use and Community
The Role of Data Lakes in Enabling Deep Learning (2013-2017)
Key Benefits of Data Lakes
Key Data Lake Technologies
Additional Cloud-Based Data Platforms (2013-2017)
Key Cloud-Based Data Platforms
Data Pipelines (2013-2017)
Data Processing Frameworks (2013-2017)
Interoperability in Data Processing and Deep Learning (2013-2017)
Benefits of Interoperability
Impact on Deep Learning
Feature Engineering Platforms (2013-2017)
Key Feature Engineering Platforms
Impact on Deep Learning
ML Lifecycle Platforms (2013-2017)
Key ML Lifecycle Platforms
Impact on Deep Learning
The Evolution of Large Language Models (Post-2017)
The Transition from DL to NLP Breakthroughs
Data Flow in Conversational AI: A Deep Dive
How ChatGPT Processes and Generates Text
APIs and Real-Time Processing in AI Chatbots
The Role of Cloud Computing in Scaling LLMs
Section Summary
Fine-Tuning and Customization of LLMs
Infrastructure Required for Customization
Challenges in Implementing LLMs
Ethical Concerns and AI Bias
Computational Costs and Energy Consumption
Security and Data Privacy in LLM Applications
Future Trends in Conversational AI
More Efficient, Smaller LLMs
AI-Powered Search Engines and Hybrid AI Assistants
Expansion into Multimodal AI (Text + Image + Video)
Summary
References
Generative AI: A Brief Overview
Data: The Lifeblood of Generative AI
Enhancing Large Language Models with Taxonomies and AI-Generated Labels
How Taxonomies Improve LLM Training
AI-Generated Labels for Adaptive Learning
The Future of Structured AI Training
Continuing with Text Datasets in Generative AI
Image Datasets: The Backbone of Visual Generative AI
Annotations: The Key to Image Dataset Utility
Curating and Preprocessing Image Datasets
Challenges in Image Dataset Design
Applications of Image Datasets in Generative AI
Multimodal Datasets: Bridging Modalities for Cross-Functional AI
What Are Multimodal Datasets?
Components of Multimodal Datasets
Curating and Preprocessing Multimodal Datasets
Applications of Multimodal Datasets
Challenges in Multimodal Dataset Development
Case Study – GPT3’s Training Data
Data Preprocessing: Preparing Data for AI
Deduplication: Data Uniqueness and Diversity
Normalization: Achieving Standardization among Different Data Formats
Annotation — Adding Contextual Metadata to Data
Preprocessing Steps for GPT Models: A Detailed Overview
Data Preprocessing Challenges
Data Augmentation: Expand and Enhance Datasets
Image Data Augmentation: Increasing Visual Variability
Text Data Augmentation: Diversifying Linguistic Patterns
Benefits of Data Augmentation
How Data Augmentation is Used in the Real World
Technologies that Make Generative AI Possible
Scalable Data Infrastructure: Storing Large and Very Large Datasets
Data Pipelines: Ensuring Seamless Data Flow
Computing Needs and Stress on High-Performance Hardware
Parallel Data Processing and Model Training with Distributed Frameworks
AI as a Managed Service: Build on Top of Cloud AI Solutions
Challenges in Data Management
Volume and Storage:
Data Quality and Biases
Privacy and Security
Legal and Ethical Considerations: Why Responsible Data Use Is Necessary
The Future of Data in Generative AI
Synthetic Data Generation
Multimodal Data Fusion
Edge Computing
Self-Supervised Learning
Federated Learning
Summary
References
Big Data Platforms for AI
Additional Tools and Integrations
Distributed Systems and Their Role in Scaling AI
Core Principles of Distributed Systems
Horizontal Scaling
Fault Tolerance
Data Localization
Key Technologies
Data Warehouses, Data Lakes, Data Lakehouses, and Cloud Storage
Data Warehouses
Data Lakes
Data Lakehouses
Cloud Storage
The Evolution of Data Storage Technologies
NoSQL Databases
Traditional RDBMS versus Modern Solutions
Cloud Databases
Choosing the Right Database for AI
Data Type
Performance Needs
Integration
Cost
Optimizing Data Access for AI Applications
Real-Time Streaming
Caching Layers
Future Trends and Emerging Concepts
Edge Storage and Processing
AI-Optimized Storage Systems
Data Governance
Summary
References
Core Types of Data
The Role of Data Quality and Master Data Management (MDM) in AI
Advanced Data Quality (DQ) Techniques
Data Profiling: Analyzing Data for Insights
Data Cleansing and Standardization: Ensuring Uniformity
Data Deduplication: Removing Redundancies
Data Lineage and Provenance: Tracing Data Journeys
Data Imputation: Addressing Missing Data
Data Consistency Validation
Data Enrichment: Adding Context
Anomaly Detection: Identifying Irregularities
Schema Drift Detection
Data Governance and Access Controls
Data Masking and Synthetic Data Generation
Bias Detection and Mitigation
Data Integrity Verification
Golden Record Creation
Business Scenarios for Data Quality Tools
Fraud Detection in Financial Institutions
Predictive Maintenance in Energy Companies
Healthcare Data Validation for Diagnostics
Data Governance for AI: Framework, Layers, and Benefits
Framework for Data Governance
Key Layers in Data Governance
Benefits of Effective Data Governance
Master Data Management (MDM) Technologies for AI
Key Business Scenarios for MDM in AI
MDM Platforms: Legacy to Cutting-Edge
Connecting MDM and DQ to Ethical Data Management
Summary
References
The Responsibility of Data in AI
Technologies for Data Privacy and Compliance
Data Minimization
Requirement Analysis
Data Mapping
Key Steps in Data Mapping
Benefits
Tools and Technologies
Best Practice
Schema Design
Data Collection Controls
Regular Audits and Monitoring
Financial Impact
Use Case
Consent Management
Why Consent Management Matters
Dynamic Consent Interfaces
Consent Storage
Enforcement Mechanisms
Audit Trails
Consent Validation API
Financial Impact
Data Anonymization
How to Implement Data Anonymization
Financial Impact
Use Case
Right to be Forgotten
Tracking and Mapping
Automated Deletion Workflows
Audit and Verification Mechanisms
Financial Impact
Use Case
Ethical Frameworks and Best Practices
Governance in High-Stakes Scenarios
Healthcare
Law Enforcement
Finance
Transportation
Education
Retail
Technology
Role of Cross-Functional Teams in Governance
Key Stakeholders and Their Roles
Data Analysts
Data Stewards
Data Scientists
Legal Teams
Governance Committees
IT Teams
Business Executives
Collaboration Tools for Cross-Functional Teams
Case Studies: Real-World Applications of Ethical AI Governance
Summary
References
Historical Evolution of Data Movement Technologies
From Manual Data Integration to Automated Pipelines
Emergence of Real-Time Data Streaming
1970s: Manual Data Integration
1980s: Rise of ETL (Extract, Transform, Load)
1990s: Commercial ETL Tools
2000s: Batch Processing and Open-Source ETL
2010s: Real-Time Data Streaming
2020s: Modern Real-Time Data Ecosystems
ETL and Real-Time Data Streaming Technologies
The Role of ETL in Data Preparation
Real-Time Data Streaming Technologies
Unified Data Pipelines
Industry-Specific Data Movement Use Cases
The Role of Metadata in Data Movement
Metadata Management
Tools for Metadata Handling
Data Movement in Multi-Cloud and Hybrid Environments
Tools for Multi-Cloud Orchestration
Data Governance and Compliance in Data Movement
Adhering to Privacy Regulations
Role of Data Observability
Advanced Techniques in Real-Time Data Processing
Windowed Operations and Aggregations
Stateful Stream Processing
Event Sourcing
Emerging Technologies and Trends
Decentralized Data Pipelines
Quantum Data Movement
Self-Healing Data Pipelines
Comparison of Data Movement Frameworks
Key Feature Matrix
Decision-Making Framework
Performance Optimization Strategies
Best Practices for Building Resilient Data Pipelines
Challenges in Data Movement
Summary
References
Monitoring AI Models with Real-Time Dashboards
Key Features of Real-Time Dashboards for AI Monitoring
Key Technologies for AI Model Monitoring
Best Practices for AI Model Dashboards
The Role of Visual Business Intelligence in AI Strategies
How BI Enhances AI Decision-Making
Fraud Detection
Customer Retention
Healthcare Analytics
Supply Chain Optimization
Marketing and Sales Analytics
Workforce Analytics
Manufacturing Process Optimization
Financial Forecasting
Real-Time Data Visualization for AI Model Monitoring
Integrating BI with AI Workflows
Real-Time Data Delivery to Business Applications
Importance of Real-Time Data Delivery
APIs (Application Programming Interfaces)
Robotic Process Automation (RPA)
Real-Time Dashboards
Benefits of Real-Time Data Delivery
Technologies Enabling AI Operationalization
AI Decision Workflow in Business Operations
Data Ingestion and Preprocessing
AI Model Inference and Decision Engine
API-Based Integration with Business Applications
Automated Workflows and RPA Execution
Continuous Monitoring and Human-in-the-Loop Feedback
AI Feedback Loop and Model Retraining
Example Use Cases of Real-Time AI Implementation
AI-Powered Chatbots
Automated Risk Assessment
AI-Optimized Inventory Management
AI-Enabled Personalized Marketing
AI-Powered Predictive Maintenance
AI-Augmented Cybersecurity
Real-Time Data Visualization for AI Model Monitoring
Case Study: AI in Financial Services
Situation
Task
Action
Result
Conclusion
Summary
References
Technological Solutions for Addressing AI Model Failures
Human-Centered AI and Explainable AI in Practice
AI-Assisted Decision-Making: A Symbiotic Approach
Key Strategies for Effective AI-Human Collaboration
Regulatory Compliance for AI Transparency: Governance and Ethical AI Practices
Core Elements of AI Transparency Compliance
Ethical AI Development Frameworks: Principles for Responsible AI
Key Ethical AI Frameworks
Best Practices for Ethical AI Implementation
Real-World Application of Ethical AI Frameworks
AI for Sustainability and Social Good
Key Characteristics of Classical AI
AI-Augmented Workforce and The Future of Work
The Ethics of AI Self-Improvement and Decision Autonomy
Ensuring Data Integrity and Ethical Use of AI
Data Quality Tools (Previously discussed in Chapter 8: MDM and Data Quality for AI)
Monitoring and Observability Technologies
Addressing AI Bias and Overfitting
Privacy-Preserving AI
Transparency and Explainability
Emerging Trends in AI Data Technologies
Real-Time Data Processing and Edge AI
Self-Healing AI Pipelines
AI-Generated Synthetic Data for Model Training
Automated AI Governance and Compliance
The Future of AI Infrastructure: Quantum AI and Beyond
Quantum Machine Learning (QML)
Neuromorphic Computing and Brain-Inspired AI
Decentralized AI and Blockchain Integration
Challenges and Future Prospects
Bio-Inspired AI Models
Applications of Bio-Inspired AI
The Future of Bio-Inspired AI
Summary
References
AI has become a transformative force across industries, from healthcare and finance to retail and manufacturing. However, while much attention is given to AI models and algorithms, the data that feeds these systems is often overlooked. This book shifts the focus to the foundational elements of AI—data architecture, storage, processing, and governance—so that organizations can effectively harness the potential of AI. Without high-quality, well-structured data, even the most advanced AI models cannot deliver reliable results.
In the first part of this book, we explore the evolution of AI and its reliance on data. We begin with an overview of AI’s history, including data mining’s role in early machine learning. From there, we examine the challenges of managing machine learning data, the infrastructure required for deep learning, and the unique data needs of large language models such as ChatGPT. The book also delves into generative AI, which requires vast datasets and specialized storage and processing solutions.
The second part of this book moves from theory to practice, detailing how organizations can operationalize data for AI. This includes modern storage solutions, master data management (MDM), data quality, governance, and the ethical considerations surrounding AI-driven decision-making. We explore real-time data pipelines, how data moves within AI-powered organizations, and the technical and business processes required to make AI truly operational. Additionally, we discuss common pitfalls and provide insights into the future of AI data infrastructure.
Whether you are a data professional, AI practitioner, or business leader, this book provides the knowledge necessary to navigate the complex world of AI data. By mastering data infrastructure, you will be better equipped to build, deploy, and scale AI systems that drive meaningful impact.
We are in the midst of the rise and evolution of Generative AI. Foundational AI continues to deliver reliable and valuable insights and Causal AI is on the horizon. What do these three pillars of AI have in common and require? Data, and an immense amount of it, and not just a one-time infusion, but an ongoing flood of data to keep all these models working at the optimal level. I am so pleased that Scott Burk and Kinshuk Dutta took on the challenge of writing a book that dives into how to obtain, clean, integrate, and use data in this rapidly evolving landscape of Al. Having worked with Scott, I know first-hand his abilities and capabilities in working with all types of data. If you are interested in learning world-class best practices of preparing data for use in your AI environment, this book will be an invaluable resource for your journey into the multifaceted world of data for AI.
John Thompson
Author, Innovator, Adjunct Professor, University of Michigan, School of Information
Dr. Scott Burk is the founder of It’s All Analytics (itsallanalytics.com), where he advises companies on creating their optimal data, AI, and analytics architecture to maximize their objectives. He stays current by consulting, writing, and teaching. He is the author of seven books on AI, data science, and analytics, including the It’s All Analytics Series. He currently teaches in the MS in Data Science program at CUNY and has taught at Baylor and Texas A&M. He has developed curricula for several universities including SMU. His experience is in solving difficult AI, statistical, and analytical problems at companies such as Dell, Texas Instruments, PayPal, eBay, Overstock.com, healthcare companies, energy companies, semiconductor and other manufacturing companies, startups, and many others across the globe. Scott has a bachelor’s degree in biology and chemistry, master’s degrees in finance, statistics, and data mining, and a PhD in statistics. Data has been the thread that has tied his professional experience together. Scott resides in Central Texas.
Kinshuk Dutta is a visionary technology leader with over 18 years of experience in Data Management, Business Integration, and Autonomous Endpoint Management. Currently Director of Product Enablement at Tanium Inc., Kinshuk has a strong record of leading global teams in Pre Sales, Customer Success, Product R&D, and Product Enablement. His career has been rooted in Data and AI, specializing in delivering sophisticated solutions to solve complex enterprise problems, accelerating sales cycles, and driving customer adoption.
Please complete all fields.