Harnessing the Power of watsonx.data: An Elegant Approach by Bob

#watsonxdata #bob #spark #datalakehouse

Bridging the Gap: Crafting a Seamless Interface for watsonx.data with Bob!

What is watsonx.data?

IBM watsonx.data is a next-generation, open architecture lakehouse designed to combine the flexibility of data lakes with the performance and structure of data warehouses. It serves as a unified data platform that allows organizations to collect, store, and analyze structured, semi-structured, and unstructured data for AI and Business Intelligence (BI) workloads. By enabling the attachment of existing data sources, it helps reduce data duplication and storage costs across hybrid-cloud and multicloud environments.

Key Components and Capabilities

Open Data Architecture: It utilizes an architecture that fully separates compute, metadata, and storage, allowing different engines to access and share data simultaneously through open formats like Apache Iceberg.
Multiple Query Engines: The platform provides fast and efficient processing at scale using multiple engines, including Presto (Java), Presto (C++), and Spark.
Built-in Governance: watsonx.data enforces schema and data integrity with integrated governance and security mechanisms, compatible with solutions like IBM Knowledge Catalog.
Hybrid & Multicloud Flexibility: It offers cost-effective object storage and integrates with a robust ecosystem of third-party services and IBM solutions like Db2 Warehouse and Netezza Performance Server.
AI-Ready Tools: It includes specialized services like Milvus, a vector database essential for managing embedding vectors used in AI applications and similarity searches.

watsonx.data Developer Edition

IBM watsonx.data developer edition is a specialized, self-contained environment designed for developers and data professionals to experiment with, evaluate, and develop proof-of-concept solutions.

Built on an open data lakehouse architecture, it allows users to quickly jumpstart development by providing pre-configured engines and integrated tools that eliminate the need for complex initial configurations. The latest release introduces a streamlined, user-friendly experience with a simplified installation process on macOS, Windows, and Linux, enabling users to efficiently provision and manage their data environments. Key features include a comprehensive user interface for infrastructure and data management, as well as support for multiple query engines like Presto and Spark to handle diverse AI and analytics workload.

Introducing the application: watsonx.data demonstration with Bob

While the native interfaces of IBM watsonx.data offer comprehensive power for data manipulation and ingestion, the true potential of a data lakehouse is often realized when integrated directly into the custom, ad-hoc applications that drive daily business decisions. This is where the “wxData-Bob” application bridges the gap, transforming complex backend capabilities into an intuitive, user-centric experience. By leveraging Bob’s application, organizations can seamlessly harness watsonx.data’s robust engine through a third-party interface that features automated data ingestion, real-time system monitoring, and an interactive SQL workspace. This implementation not only showcases the flexibility of the watsonx.data API but also empowers teams to interact with their data through a streamlined, elegant dashboard designed for modern productivity.

The watsonx.data Developer Edition Demo Application (also known as wxData-Bob) is a comprehensive demonstration tool designed to showcase the integration and power of IBM watsonx.data through a custom third-party interface. Built with a React frontend and a Node.js/Express backend, the application serves as a bridge between the robust data lakehouse capabilities of watsonx.data and an intuitive, user-centric web dashboard. It is specifically tailored for developers to experiment with and evaluate how to automate data operations — such as ingestion, catalog management, and complex SQL querying — outside of the native IBM user interface.

The application provides a centralized platform to manage the lifecycle of data within a lakehouse environment:

Unified Ingestion & Storage: Enables browser-based file uploads directly to MinIO/S3 and supports automated ingestion for various formats like JSON, CSV, and Parquet.

Comprehensive Catalog Management: Offers full CRUD operations for Iceberg and Hive catalogs, including schema visualization and table metadata viewing.
Interactive Query Workspace: Features a SQL editor with syntax highlighting and result visualization, allowing users to execute Presto or Spark queries against the lakehouse.

Real-time Monitoring: Includes a dedicated dashboard for tracking system metrics, component health, and performance analytics with live updates.

Secure API Integration: Utilizes bearer token-based authentication and secure credential management to interact with the watsonx.data REST APIs.
Flexible Deployment: Supports containerized environments via Docker Compose and Kubernetes manifests for both local development and scaled production testing.

The Application

Crafted with a modern, full React frontend and a robust Node.js/Express backend, this application provides a seamless bridge to watsonx.data capabilities. To ensure a comprehensive testing environment, Bob included a Python-based data generation script to prepopulate the system, though users can also leverage the intuitive browser-based UI to upload and ingest documents directly into MinIO or S3 storage. The architecture is built on a foundation of full REST/CRUD functionalities, empowering developers to perform complex catalog management, schema visualization, and interactive SQL query execution with ease. Designed for modern infrastructure, the project is fully containerized with all necessary Dockerfiles and Kubernetes manifests — including ConfigMaps, Secrets, and Ingress templates — enabling a smooth transition from local experimentation to professional cloud or cluster deployments.

Key Application Features:

Intuitive Web Interface: A React-based dashboard featuring a SQL editor with syntax highlighting and real-time performance analytics.

/**
 * Dashboard Page
 * 
 * Main dashboard showing system overview and quick actions
 */

import React, { useEffect, useState } from 'react';
import {
  Box,
  Card,
  CardContent,
  Grid,
  Typography,
  Button,
  Chip,
} from '@mui/material';
import {
  CloudUpload,
  CheckCircle,
  Error,
  HourglassEmpty,
} from '@mui/icons-material';
import { useNavigate } from 'react-router-dom';
import axios from 'axios';

export default function Dashboard() {
  const navigate = useNavigate();
  const [authStatus, setAuthStatus] = useState(null);
  const [recentJobs, setRecentJobs] = useState([]);

  useEffect(() => {
    fetchAuthStatus();
    fetchRecentJobs();
  }, []);

  const fetchAuthStatus = async () => {
    try {
      const response = await axios.get('/api/auth/status');
      setAuthStatus(response.data.data);
    } catch (error) {
      console.error('Error fetching auth status:', error);
    }
  };

  const fetchRecentJobs = async () => {
    try {
      const response = await axios.get('/api/ingestion/jobs?limit=5');
      setRecentJobs(response.data.data || []);
    } catch (error) {
      console.error('Error fetching recent jobs:', error);
    }
  };

  const getStatusIcon = (status) => {
    switch (status) {
      case 'completed':
        return <CheckCircle color="success" />;
      case 'failed':
        return <Error color="error" />;
      case 'running':
      case 'starting':
        return <HourglassEmpty color="warning" />;
      default:
        return null;
    }
  };

  const getStatusColor = (status) => {
    switch (status) {
      case 'completed':
        return 'success';
      case 'failed':
        return 'error';
      case 'running':
      case 'starting':
        return 'warning';
      default:
        return 'default';
    }
  };

  return (
    <Box>
      <Typography variant="h4" gutterBottom>
        Dashboard
      </Typography>

      <Grid container spacing={3}>
        {/* Authentication Status Card */}
        <Grid item xs={12} md={6}>
          <Card>
            <CardContent>
              <Typography variant="h6" gutterBottom>
                Authentication Status
              </Typography>
              {authStatus ? (
                <Box>
                  <Chip
                    label={authStatus.hasToken && !authStatus.isExpired ? 'Connected' : 'Disconnected'}
                    color={authStatus.hasToken && !authStatus.isExpired ? 'success' : 'error'}
                    sx={{ mb: 2 }}
                  />
                  {authStatus.hasToken && !authStatus.isExpired && (
                    <Typography variant="body2" color="text.secondary">
                      Token expires in: {Math.floor(authStatus.expiresIn / 1000 / 60)} minutes
                    </Typography>
                  )}
                </Box>
              ) : (
                <Typography variant="body2" color="text.secondary">
                  Loading...
                </Typography>
              )}
            </CardContent>
          </Card>
        </Grid>

        {/* Quick Actions Card */}
        <Grid item xs={12} md={6}>
          <Card>
            <CardContent>
              <Typography variant="h6" gutterBottom>
                Quick Actions
              </Typography>
              <Box sx={{ display: 'flex', gap: 2, flexWrap: 'wrap' }}>
                <Button
                  variant="contained"
                  startIcon={<CloudUpload />}
                  onClick={() => navigate('/ingestion')}
                >
                  New Ingestion
                </Button>
                <Button
                  variant="outlined"
                  onClick={() => navigate('/jobs')}
                >
                  View Jobs
                </Button>
              </Box>
            </CardContent>
          </Card>
        </Grid>

        {/* Recent Jobs Card */}
        <Grid item xs={12}>
          <Card>
            <CardContent>
              <Typography variant="h6" gutterBottom>
                Recent Ingestion Jobs
              </Typography>
              {recentJobs.length > 0 ? (
                <Box>
                  {recentJobs.map((job) => (
                    <Box
                      key={job.job_id}
                      sx={{
                        display: 'flex',
                        alignItems: 'center',
                        justifyContent: 'space-between',
                        py: 1,
                        borderBottom: '1px solid',
                        borderColor: 'divider',
                        '&:last-child': { borderBottom: 'none' },
                      }}
                    >
                      <Box sx={{ display: 'flex', alignItems: 'center', gap: 2 }}>
                        {getStatusIcon(job.status)}
                        <Box>
                          <Typography variant="body1">{job.job_id}</Typography>
                          <Typography variant="body2" color="text.secondary">
                            {job.target_table || 'N/A'}
                          </Typography>
                        </Box>
                      </Box>
                      <Chip
                        label={job.status}
                        color={getStatusColor(job.status)}
                        size="small"
                      />
                    </Box>
                  ))}
                </Box>
              ) : (
                <Typography variant="body2" color="text.secondary">
                  No recent jobs
                </Typography>
              )}
            </CardContent>
          </Card>
        </Grid>

        {/* System Info Card */}
        <Grid item xs={12}>
          <Card>
            <CardContent>
              <Typography variant="h6" gutterBottom>
                System Information
              </Typography>
              <Grid container spacing={2}>
                <Grid item xs={12} sm={6} md={3}>
                  <Typography variant="body2" color="text.secondary">
                    watsonx.data URL
                  </Typography>
                  <Typography variant="body1">
                    https://localhost:6443
                  </Typography>
                </Grid>
                <Grid item xs={12} sm={6} md={3}>
                  <Typography variant="body2" color="text.secondary">
                    Default Engine
                  </Typography>
                  <Typography variant="body1">
                    spark158
                  </Typography>
                </Grid>
                <Grid item xs={12} sm={6} md={3}>
                  <Typography variant="body2" color="text.secondary">
                    Default Bucket
                  </Typography>
                  <Typography variant="body1">
                    iceberg-bucket
                  </Typography>
                </Grid>
                <Grid item xs={12} sm={6} md={3}>
                  <Typography variant="body2" color="text.secondary">
                    Bucket Type
                  </Typography>
                  <Typography variant="body1">
                    MinIO
                  </Typography>
                </Grid>
              </Grid>
            </CardContent>
          </Card>
        </Grid>
      </Grid>
    </Box>
  );
}

// Made with Bob

Automated Ingestion: Support for multiple formats (JSON, CSV, Parquet) with automatic file type detection and batch ingestion services.

/**
 * Ingestion Service
 * 
 * Handles data ingestion operations with watsonx.data including:
 * - Creating ingestion jobs
 * - Monitoring job status
 * - Managing ingestion configurations
 */

const axios = require('axios');
const https = require('https');
const config = require('../config/watsonx.config');
const authService = require('./authService');
const logger = require('../utils/logger');

class IngestionService {
  constructor() {
    this.axiosInstance = axios.create({
      httpsAgent: new https.Agent({
        rejectUnauthorized: config.watsonxData.ssl.rejectUnauthorized
      }),
      timeout: config.watsonxData.timeout
    });
  }

  /**
   * Create a new ingestion job
   * @param {Object} jobConfig - Ingestion job configuration
   * @returns {Promise<Object>} Job creation response
   */
  async createIngestionJob(jobConfig) {
    try {
      const token = await authService.getToken();
      const url = `${config.watsonxData.baseUrl}${config.watsonxData.endpoints.ingestion}`;

      // Merge with default engine config
      const executeConfig = {
        ...config.watsonxData.defaultEngine,
        ...jobConfig.execute_config
      };

      const payload = {
        target: jobConfig.target,
        source: jobConfig.source,
        job_id: jobConfig.job_id || `ingestion-${Date.now()}`,
        engine_id: jobConfig.engine_id || config.watsonxData.defaultEngine.engineId,
        execute_config: {
          driver_memory: executeConfig.driverMemory,
          driver_cores: executeConfig.driverCores,
          executor_memory: executeConfig.executorMemory,
          executor_cores: executeConfig.executorCores,
          num_executors: executeConfig.numExecutors
        }
      };

      logger.info('Creating ingestion job', { jobId: payload.job_id });

      const response = await this.axiosInstance.post(url, payload, {
        headers: {
          'Accept': 'application/json',
          'Content-Type': 'application/json',
          'Authorization': `Bearer ${token}`,
          'Authinstanceid': config.watsonxData.instanceId
        }
      });

      logger.info('Ingestion job created successfully', {
        jobId: response.data.job_id,
        applicationId: response.data.application_id,
        status: response.data.status
      });

      return response.data;
    } catch (error) {
      logger.error('Failed to create ingestion job', {
        error: error.message,
        response: error.response?.data
      });
      throw new Error(`Ingestion job creation failed: ${error.message}`);
    }
  }

  /**
   * Get ingestion job status
   * @param {string} jobId - Job ID
   * @returns {Promise<Object>} Job status
   */
  async getJobStatus(jobId) {
    try {
      const token = await authService.getToken();
      const url = `${config.watsonxData.baseUrl}${config.watsonxData.endpoints.ingestion}/${jobId}`;

      logger.info('Fetching job status', { jobId });

      const response = await this.axiosInstance.get(url, {
        headers: {
          'Accept': 'application/json',
          'Authorization': `Bearer ${token}`,
          'Authinstanceid': config.watsonxData.instanceId
        }
      });

      return response.data;
    } catch (error) {
      logger.error('Failed to fetch job status', {
        jobId,
        error: error.message
      });
      throw new Error(`Failed to get job status: ${error.message}`);
    }
  }

  /**
   * List all ingestion jobs
   * @param {Object} filters - Optional filters
   * @returns {Promise<Array>} List of jobs
   */
  async listJobs(filters = {}) {
    try {
      const token = await authService.getToken();
      const url = `${config.watsonxData.baseUrl}${config.watsonxData.endpoints.ingestion}`;

      logger.info('Listing ingestion jobs', { filters });

      const response = await this.axiosInstance.get(url, {
        headers: {
          'Accept': 'application/json',
          'Authorization': `Bearer ${token}`,
          'Authinstanceid': config.watsonxData.instanceId
        },
        params: filters
      });

      return response.data;
    } catch (error) {
      logger.error('Failed to list jobs', {
        error: error.message
      });
      throw new Error(`Failed to list jobs: ${error.message}`);
    }
  }

  /**
   * Cancel an ingestion job
   * @param {string} jobId - Job ID
   * @returns {Promise<Object>} Cancellation response
   */
  async cancelJob(jobId) {
    try {
      const token = await authService.getToken();
      const url = `${config.watsonxData.baseUrl}${config.watsonxData.endpoints.ingestion}/${jobId}`;

      logger.info('Cancelling ingestion job', { jobId });

      const response = await this.axiosInstance.delete(url, {
        headers: {
          'Accept': 'application/json',
          'Authorization': `Bearer ${token}`,
          'Authinstanceid': config.watsonxData.instanceId
        }
      });

      logger.info('Job cancelled successfully', { jobId });

      return response.data;
    } catch (error) {
      logger.error('Failed to cancel job', {
        jobId,
        error: error.message
      });
      throw new Error(`Failed to cancel job: ${error.message}`);
    }
  }

  /**
   * Validate ingestion configuration
   * @param {Object} config - Ingestion configuration
   * @returns {Object} Validation result
   */
  validateIngestionConfig(config) {
    const errors = [];

    // Validate target
    if (!config.target) {
      errors.push('Target configuration is required');
    } else {
      if (!config.target.catalog) errors.push('Target catalog is required');
      if (!config.target.schema) errors.push('Target schema is required');
      if (!config.target.table) errors.push('Target table is required');
    }

    // Validate source
    if (!config.source) {
      errors.push('Source configuration is required');
    } else {
      if (!config.source.file_paths) errors.push('Source file paths are required');
      if (!config.source.file_type) errors.push('Source file type is required');

      const validFileTypes = ['json', 'csv', 'parquet', 'avro', 'orc'];
      if (config.source.file_type && !validFileTypes.includes(config.source.file_type.toLowerCase())) {
        errors.push(`Invalid file type. Must be one of: ${validFileTypes.join(', ')}`);
      }
    }

    return {
      valid: errors.length === 0,
      errors
    };
  }

  /**
   * Get supported file types
   * @returns {Array<string>} List of supported file types
   */
  getSupportedFileTypes() {
    return ['json', 'csv', 'parquet', 'avro', 'orc'];
  }

  /**
   * Get default ingestion configuration
   * @returns {Object} Default configuration
   */
  getDefaultConfig() {
    return {
      engine_id: config.watsonxData.defaultEngine.engineId,
      execute_config: {
        driver_memory: config.watsonxData.defaultEngine.driverMemory,
        driver_cores: config.watsonxData.defaultEngine.driverCores,
        executor_memory: config.watsonxData.defaultEngine.executorMemory,
        executor_cores: config.watsonxData.defaultEngine.executorCores,
        num_executors: config.watsonxData.defaultEngine.numExecutors
      },
      bucket_details: {
        bucket_name: config.watsonxData.defaultBucket.bucketName,
        bucket_type: config.watsonxData.defaultBucket.bucketType
      }
    };
  }
}

module.exports = new IngestionService();

// Made with Bob

Advanced Catalog Management: Full lifecycle operations for Iceberg and Hive catalogs, including schema modification and metadata viewing.
Cloud-Ready Deployment: Includes Docker Compose and Kubernetes configurations for high-availability deployments on Red Hat OpenShift or other cloud platforms.

# Frontend Dockerfile for watsonx.data Demo Application

# Build stage
FROM node:18-alpine AS builder

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci

# Copy application code
COPY . .

# Build the application
RUN npm run build

# Production stage
FROM nginx:alpine

# Copy custom nginx config
COPY nginx.conf /etc/nginx/conf.d/default.conf

# Copy built application from builder stage
COPY --from=builder /app/build /usr/share/nginx/html

# Create non-root user
RUN addgroup -g 1001 -S nginx && \
    adduser -S nginx -u 1001 -G nginx && \
    chown -R nginx:nginx /usr/share/nginx/html && \
    chown -R nginx:nginx /var/cache/nginx && \
    chown -R nginx:nginx /var/log/nginx && \
    chown -R nginx:nginx /etc/nginx/conf.d && \
    touch /var/run/nginx.pid && \
    chown -R nginx:nginx /var/run/nginx.pid

# Switch to non-root user
USER nginx

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/ || exit 1

# Start nginx
CMD ["nginx", "-g", "daemon off;"]

# Made with Bob

# Backend Deployment for watsonx.data Demo Application

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wxdata-backend
  namespace: wxdata-demo
  labels:
    app: wxdata-demo
    component: backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wxdata-demo
      component: backend
  template:
    metadata:
      labels:
        app: wxdata-demo
        component: backend
    spec:
      containers:
      - name: backend
        image: wxdata-backend:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3001
          name: http
          protocol: TCP
        env:
        - name: NODE_ENV
          valueFrom:
            configMapKeyRef:
              name: wxdata-config
              key: NODE_ENV
        - name: PORT
          valueFrom:
            configMapKeyRef:
              name: wxdata-config
              key: PORT
        - name: HOST
          valueFrom:
            configMapKeyRef:
              name: wxdata-config
              key: HOST
        - name: WATSONX_BASE_URL
          valueFrom:
            configMapKeyRef:
              name: wxdata-config
              key: WATSONX_BASE_URL
        - name: WATSONX_INSTANCE_ID
          valueFrom:
            configMapKeyRef:
              name: wxdata-config
              key: WATSONX_INSTANCE_ID
        - name: WATSONX_USERNAME
          valueFrom:
            secretKeyRef:
              name: wxdata-secret
              key: WATSONX_USERNAME
        - name: WATSONX_PASSWORD
          valueFrom:
            secretKeyRef:
              name: wxdata-secret
              key: WATSONX_PASSWORD
        - name: CORS_ORIGIN
          valueFrom:
            configMapKeyRef:
              name: wxdata-config
              key: CORS_ORIGIN
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3001
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 3001
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: logs
          mountPath: /app/logs
      volumes:
      - name: logs
        emptyDir: {}
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001

---
apiVersion: v1
kind: Service
metadata:
  name: wxdata-backend
  namespace: wxdata-demo
  labels:
    app: wxdata-demo
    component: backend
spec:
  type: ClusterIP
  ports:
  - port: 3001
    targetPort: 3001
    protocol: TCP
    name: http
  selector:
    app: wxdata-demo
    component: backend

# Made with Bob

Conclusion

In conclusion, Bob’s “The watsonx Whisperer” application represents a paradigm shift in how organizations can interact with their data lakehouse. By wrapping the industrial-strength power of IBM watsonx.data — from its multi-engine Presto and Spark architecture to its flexible Iceberg and Hive catalogs — into a sleek, developer-centric React and Node.js interface, this project proves that complex big data can indeed “feel small”. Whether you are leveraging the provided Python scripts for rapid data generation, utilizing the UI for seamless browser-based uploads, or deploying at scale via the included Kubernetes and Docker manifests, everything you need for a production-ready proof of concept is at your fingertips. This implementation doesn’t just showcase an application; it provides a comprehensive blueprint for building intuitive, high-performance portals that unlock the true agility of the next generation of AI and data analytics.

Beyond the specific interface Bob has built, the true magic lies in the foundational power of IBM watsonx.data Developer Edition, which brings the architecture of a global enterprise lakehouse directly to a developer’s local environment. By integrating top-tier open-source engines like Presto for high-performance SQL and Spark for complex data processing, the platform empowers users to query vast, distributed datasets with the same agility as a local database. Bob’s project highlights how easily one can tap into these industrial-strength features — such as Apache Iceberg for transactional consistency and Milvus for vector-based AI workloads — to rapidly prototype and deploy sophisticated data solutions. Ultimately, this synergy between watsonx.data’s open architecture and creative development demonstrates that the future of data isn’t just about storage; it’s about the freedom to build, scale, and innovate without boundaries.