Data for AI

Data for AI: Data Infrastructure for Machine Intelligence, by Scott Burk and Kinshuk Dutta

Artificial Intelligence (AI) is only as powerful as the data that fuels it, and this book is your comprehensive guide to understanding the critical data infrastructure that makes AI work.

Topics

Part I: AI Evolution and Data Overview

Chapter 1: Introduction to Data for AI

Three Pillars of Accelerated AI Adoption

Focus on Data

Chapter 2: Data Mining in AI

Modeling Paradigms for DM

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation Phase

Deployment Phase

Data Mining and Supporting Data Technology

Data Preprocessing

Data Catalog Platforms

Data Technology for Visualization, BI, and Exploratory Data Analysis

Data Technology for Dimensionality Reduction

Data Technology Supporting Knowledge Graphs

Data Virtualization for Data Mining

Application Case Highlight – Data Mining to Support Accountable Healthcare

Summary

References

Chapter 3: Data Challenges in Machine Learning

Getting Machine Learning Infrastructure Right

Technical Debt in Machine Learning

Model Dependencies

Data Dependencies

Feedback

Major Types of Machine Learning and Data to Support Them

Supervised Learning

Predicting Loan Default

Unsupervised Learning

Claim Clustering for Property and Casualty Insurance

Semi-Supervised Learning

Property and Casualty Insurance Example with Self-Training

Property and Casualty Insurance Example with Co-Training

Reinforcement Learning

Inventory Management Example

Ensemble Learning

Bagging (Bootstrap Aggregating)

Boosting Methods

Property and Casualty Insurance Example with Boosting

Section Summary

The Rise of Big Data and a Technology Paradigm Shift

Challenges of Traditional ML Architecture

Application Case Highlight – The Next Generation of Medicine – Predicting Asthma

Data Technology was the Barrier for Years

Medical Science and Acquiring the Right Data

Machine Learning for AI Asthma Models

Model Results and Applications

Multiple Applications

Case Management Application

SMS Text Alerts

Emergency Management Reports

Summary

References

Chapter 4: Deep Learning and Data Infrastructure

The Role of Apache Spark in Enabling Deep Learning

In-Memory Computing and Distributed Computing

Integration with DL Libraries and Real-Time Data Processing

Ease of Use and Community

The Role of Data Lakes in Enabling Deep Learning (2013-2017)

Key Benefits of Data Lakes

Key Data Lake Technologies

Additional Cloud-Based Data Platforms (2013-2017)

Key Cloud-Based Data Platforms

Data Pipelines (2013-2017)

Data Processing Frameworks (2013-2017)

Interoperability in Data Processing and Deep Learning (2013-2017)

Benefits of Interoperability

Impact on Deep Learning

Feature Engineering Platforms (2013-2017)

Key Feature Engineering Platforms

Impact on Deep Learning

ML Lifecycle Platforms (2013-2017)

Key ML Lifecycle Platforms

Impact on Deep Learning

Chapter 5: ChatGPT and Large Language Models

The Evolution of Large Language Models (Post-2017)

The Transition from DL to NLP Breakthroughs

Data Flow in Conversational AI: A Deep Dive

How ChatGPT Processes and Generates Text

APIs and Real-Time Processing in AI Chatbots

The Role of Cloud Computing in Scaling LLMs

Section Summary

Fine-Tuning and Customization of LLMs

Infrastructure Required for Customization

Challenges in Implementing LLMs

Ethical Concerns and AI Bias

Computational Costs and Energy Consumption

Security and Data Privacy in LLM Applications

Future Trends in Conversational AI

More Efficient, Smaller LLMs

AI-Powered Search Engines and Hybrid AI Assistants

Expansion into Multimodal AI (Text + Image + Video)

Summary

References

Chapter 6: Data in Generative AI

Generative AI: A Brief Overview

Data: The Lifeblood of Generative AI

Enhancing Large Language Models with Taxonomies and AI-Generated Labels

How Taxonomies Improve LLM Training

AI-Generated Labels for Adaptive Learning

The Future of Structured AI Training

Continuing with Text Datasets in Generative AI

Image Datasets: The Backbone of Visual Generative AI

Annotations: The Key to Image Dataset Utility

Curating and Preprocessing Image Datasets

Challenges in Image Dataset Design

Applications of Image Datasets in Generative AI

Multimodal Datasets: Bridging Modalities for Cross-Functional AI

What Are Multimodal Datasets?

Components of Multimodal Datasets

Curating and Preprocessing Multimodal Datasets

Applications of Multimodal Datasets

Challenges in Multimodal Dataset Development

Case Study – GPT3’s Training Data

Data Preprocessing: Preparing Data for AI

Deduplication: Data Uniqueness and Diversity

Normalization: Achieving Standardization among Different Data Formats

Annotation — Adding Contextual Metadata to Data

Preprocessing Steps for GPT Models: A Detailed Overview

Data Preprocessing Challenges

Data Augmentation: Expand and Enhance Datasets

Image Data Augmentation: Increasing Visual Variability

Text Data Augmentation: Diversifying Linguistic Patterns

Benefits of Data Augmentation

How Data Augmentation is Used in the Real World

Technologies that Make Generative AI Possible

Scalable Data Infrastructure: Storing Large and Very Large Datasets

Data Pipelines: Ensuring Seamless Data Flow

Computing Needs and Stress on High-Performance Hardware

Parallel Data Processing and Model Training with Distributed Frameworks

AI as a Managed Service: Build on Top of Cloud AI Solutions

Challenges in Data Management

Volume and Storage:

Data Quality and Biases

Privacy and Security

Legal and Ethical Considerations: Why Responsible Data Use Is Necessary

The Future of Data in Generative AI

Synthetic Data Generation

Multimodal Data Fusion

Edge Computing

Self-Supervised Learning

Federated Learning

Summary

References

Part II: Operationalizing Data for AI

Chapter 7: Modern Data Storage and Processing for AI

Big Data Platforms for AI

Additional Tools and Integrations

Distributed Systems and Their Role in Scaling AI

Core Principles of Distributed Systems

Horizontal Scaling

Fault Tolerance

Data Localization

Key Technologies

Data Warehouses, Data Lakes, Data Lakehouses, and Cloud Storage

Data Warehouses

Data Lakes

Data Lakehouses

Cloud Storage

The Evolution of Data Storage Technologies

NoSQL Databases

Traditional RDBMS versus Modern Solutions

Cloud Databases

Choosing the Right Database for AI

Data Type

Performance Needs

Integration

Cost

Optimizing Data Access for AI Applications

Real-Time Streaming

Caching Layers

Future Trends and Emerging Concepts

Edge Storage and Processing

AI-Optimized Storage Systems

Data Governance

Summary

References

Chapter 8: Master Data Management (MDM) and Data Quality for AI

Core Types of Data

The Role of Data Quality and Master Data Management (MDM) in AI

Advanced Data Quality (DQ) Techniques

Data Profiling: Analyzing Data for Insights

Data Cleansing and Standardization: Ensuring Uniformity

Data Deduplication: Removing Redundancies

Data Lineage and Provenance: Tracing Data Journeys

Data Imputation: Addressing Missing Data

Data Consistency Validation

Data Enrichment: Adding Context

Anomaly Detection: Identifying Irregularities

Schema Drift Detection

Data Governance and Access Controls

Data Masking and Synthetic Data Generation

Bias Detection and Mitigation

Data Integrity Verification

Golden Record Creation

Business Scenarios for Data Quality Tools

Fraud Detection in Financial Institutions

Predictive Maintenance in Energy Companies

Healthcare Data Validation for Diagnostics

Data Governance for AI: Framework, Layers, and Benefits

Framework for Data Governance

Key Layers in Data Governance

Benefits of Effective Data Governance

Master Data Management (MDM) Technologies for AI

Key Business Scenarios for MDM in AI

MDM Platforms: Legacy to Cutting-Edge

Connecting MDM and DQ to Ethical Data Management

Summary

References

Chapter 9: Ethical Data Management and Governance for AI

The Responsibility of Data in AI

Technologies for Data Privacy and Compliance

Data Minimization

Requirement Analysis

Data Mapping

Key Steps in Data Mapping

Benefits

Tools and Technologies

Best Practice

Schema Design

Data Collection Controls

Regular Audits and Monitoring

Financial Impact

Use Case

Consent Management

Why Consent Management Matters

Dynamic Consent Interfaces

Consent Storage

Enforcement Mechanisms

Audit Trails

Consent Validation API

Financial Impact

Data Anonymization

How to Implement Data Anonymization

Financial Impact

Use Case

Right to be Forgotten

Tracking and Mapping

Automated Deletion Workflows

Audit and Verification Mechanisms

Financial Impact

Use Case

Ethical Frameworks and Best Practices

Governance in High-Stakes Scenarios

Healthcare

Law Enforcement

Finance

Transportation

Education

Retail

Technology

Role of Cross-Functional Teams in Governance

Key Stakeholders and Their Roles

Data Analysts

Data Stewards

Data Scientists

Legal Teams

Governance Committees

IT Teams

Business Executives

Collaboration Tools for Cross-Functional Teams

Case Studies: Real-World Applications of Ethical AI Governance

Summary

References

Chapter 10: How Data Moves in AI-Powered Organizations

Historical Evolution of Data Movement Technologies

From Manual Data Integration to Automated Pipelines

Emergence of Real-Time Data Streaming

1970s: Manual Data Integration

1980s: Rise of ETL (Extract, Transform, Load)

1990s: Commercial ETL Tools

2000s: Batch Processing and Open-Source ETL

2010s: Real-Time Data Streaming

2020s: Modern Real-Time Data Ecosystems

ETL and Real-Time Data Streaming Technologies

The Role of ETL in Data Preparation

Real-Time Data Streaming Technologies

Unified Data Pipelines

Industry-Specific Data Movement Use Cases

The Role of Metadata in Data Movement

Metadata Management

Tools for Metadata Handling

Data Movement in Multi-Cloud and Hybrid Environments

Tools for Multi-Cloud Orchestration

Data Governance and Compliance in Data Movement

Adhering to Privacy Regulations

Role of Data Observability

Advanced Techniques in Real-Time Data Processing

Windowed Operations and Aggregations

Stateful Stream Processing

Event Sourcing

Emerging Technologies and Trends

Decentralized Data Pipelines

Quantum Data Movement

Self-Healing Data Pipelines

Comparison of Data Movement Frameworks

Key Feature Matrix

Decision-Making Framework

Performance Optimization Strategies

Best Practices for Building Resilient Data Pipelines

Challenges in Data Movement

Summary

References

Chapter 11: Making AI Operational

Monitoring AI Models with Real-Time Dashboards

Key Features of Real-Time Dashboards for AI Monitoring

Key Technologies for AI Model Monitoring

Best Practices for AI Model Dashboards

The Role of Visual Business Intelligence in AI Strategies

How BI Enhances AI Decision-Making

Fraud Detection

Customer Retention

Healthcare Analytics

Supply Chain Optimization

Marketing and Sales Analytics

Workforce Analytics

Manufacturing Process Optimization

Financial Forecasting

Real-Time Data Visualization for AI Model Monitoring

Integrating BI with AI Workflows

Real-Time Data Delivery to Business Applications

Importance of Real-Time Data Delivery

APIs (Application Programming Interfaces)

Robotic Process Automation (RPA)

Real-Time Dashboards

Benefits of Real-Time Data Delivery

Technologies Enabling AI Operationalization

AI Decision Workflow in Business Operations

Data Ingestion and Preprocessing

AI Model Inference and Decision Engine

API-Based Integration with Business Applications

Automated Workflows and RPA Execution

Continuous Monitoring and Human-in-the-Loop Feedback

AI Feedback Loop and Model Retraining

Example Use Cases of Real-Time AI Implementation

AI-Powered Chatbots

Automated Risk Assessment

AI-Optimized Inventory Management

AI-Enabled Personalized Marketing

AI-Powered Predictive Maintenance

AI-Augmented Cybersecurity

Real-Time Data Visualization for AI Model Monitoring

Case Study: AI in Financial Services

Situation

Task

Action

Result

Conclusion

Summary

References

Chapter 12: Avoiding Common Pitfalls and the Future of AI

Technological Solutions for Addressing AI Model Failures

Human-Centered AI and Explainable AI in Practice

AI-Assisted Decision-Making: A Symbiotic Approach

Key Strategies for Effective AI-Human Collaboration

Regulatory Compliance for AI Transparency: Governance and Ethical AI Practices

Core Elements of AI Transparency Compliance

Ethical AI Development Frameworks: Principles for Responsible AI

Key Ethical AI Frameworks

Best Practices for Ethical AI Implementation

Real-World Application of Ethical AI Frameworks

AI for Sustainability and Social Good

Key Characteristics of Classical AI

AI-Augmented Workforce and The Future of Work

The Ethics of AI Self-Improvement and Decision Autonomy

Ensuring Data Integrity and Ethical Use of AI

Data Quality Tools (Previously discussed in Chapter 8: MDM and Data Quality for AI)

Monitoring and Observability Technologies

Addressing AI Bias and Overfitting

Privacy-Preserving AI

Transparency and Explainability

Emerging Trends in AI Data Technologies

Real-Time Data Processing and Edge AI

Self-Healing AI Pipelines

AI-Generated Synthetic Data for Model Training

Automated AI Governance and Compliance

The Future of AI Infrastructure: Quantum AI and Beyond

Quantum Machine Learning (QML)

Neuromorphic Computing and Brain-Inspired AI

Decentralized AI and Blockchain Integration

Challenges and Future Prospects

Bio-Inspired AI Models

Applications of Bio-Inspired AI

The Future of Bio-Inspired AI

Summary

References

AI has become a transformative force across industries, from healthcare and finance to retail and manufacturing. However, while much attention is given to AI models and algorithms, the data that feeds these systems is often overlooked. This book shifts the focus to the foundational elements of AI—data architecture, storage, processing, and governance—so that organizations can effectively harness the potential of AI. Without high-quality, well-structured data, even the most advanced AI models cannot deliver reliable results.

In the first part of this book, we explore the evolution of AI and its reliance on data. We begin with an overview of AI’s history, including data mining’s role in early machine learning. From there, we examine the challenges of managing machine learning data, the infrastructure required for deep learning, and the unique data needs of large language models such as ChatGPT. The book also delves into generative AI, which requires vast datasets and specialized storage and processing solutions.

The second part of this book moves from theory to practice, detailing how organizations can operationalize data for AI. This includes modern storage solutions, master data management (MDM), data quality, governance, and the ethical considerations surrounding AI-driven decision-making. We explore real-time data pipelines, how data moves within AI-powered organizations, and the technical and business processes required to make AI truly operational. Additionally, we discuss common pitfalls and provide insights into the future of AI data infrastructure.

Whether you are a data professional, AI practitioner, or business leader, this book provides the knowledge necessary to navigate the complex world of AI data. By mastering data infrastructure, you will be better equipped to build, deploy, and scale AI systems that drive meaningful impact.

We are in the midst of the rise and evolution of Generative AI. Foundational AI continues to deliver reliable and valuable insights and Causal AI is on the horizon. What do these three pillars of AI have in common and require? Data, and an immense amount of it, and not just a one-time infusion, but an ongoing flood of data to keep all these models working at the optimal level. I am so pleased that Scott Burk and Kinshuk Dutta took on the challenge of writing a book that dives into how to obtain, clean, integrate, and use data in this rapidly evolving landscape of Al. Having worked with Scott, I know first-hand his abilities and capabilities in working with all types of data. If you are interested in learning world-class best practices of preparing data for use in your AI environment, this book will be an invaluable resource for your journey into the multifaceted world of data for AI.

John Thompson
Author, Innovator, Adjunct Professor, University of Michigan, School of Information

About Scott and Kinshuk

Dr. Scott Burk is the founder of It’s All Analytics (itsallanalytics.com), where he advises companies on creating their optimal data, AI, and analytics architecture to maximize their objectives. He stays current by consulting, writing, and teaching. He is the author of seven books on AI, data science, and analytics, including the It’s All Analytics Series. He currently teaches in the MS in Data Science program at CUNY and has taught at Baylor and Texas A&M. He has developed curricula for several universities including SMU. His experience is in solving difficult AI, statistical, and analytical problems at companies such as Dell, Texas Instruments, PayPal, eBay, Overstock.com, healthcare companies, energy companies, semiconductor and other manufacturing companies, startups, and many others across the globe. Scott has a bachelor’s degree in biology and chemistry, master’s degrees in finance, statistics, and data mining, and a PhD in statistics. Data has been the thread that has tied his professional experience together. Scott resides in Central Texas.

Kinshuk Dutta is a visionary technology leader with over 18 years of experience in Data Management, Business Integration, and Autonomous Endpoint Management. Currently Director of Product Enablement at Tanium Inc., Kinshuk has a strong record of leading global teams in Pre Sales, Customer Success, Product R&D, and Product Enablement. His career has been rooted in Data and AI, specializing in delivering sophisticated solutions to solve complex enterprise problems, accelerating sales cycles, and driving customer adoption.

Bestsellers

Faculty may request complimentary digital desk copies

Please complete all fields.

Data for AI PDF Instant Download quantity	Data for AI PDF Instant Download	Original price was: $49.95.Current price is: $44.95.
Data for AI Print Version quantity	Data for AI Print Version	$49.95