VisualGPT: Multimodal AI for User Perception

Published:

Overview

VisualGPT represents a breakthrough in multimodal AI that bridges the gap between computational vision models and human perception. By developing a dual-route processing framework that integrates state-of-the-art language models (GPT) with computer vision architectures (VGG, CLIP), we achieved a 20% improvement in predicting human impressions of visual content compared to baseline models.

Project Motivation

Traditional computer vision models excel at object recognition but fail to capture the nuanced human impressions that drive real-world decisions. Our research addresses this gap by:

  • Modeling Human Perception: Creating AI systems that predict human emotional and cognitive responses to visual content
  • Multimodal Integration: Combining textual context with visual features for more accurate predictions
  • Practical Applications: Enabling better user experience design, content curation, and human-AI interaction

Theoretical Framework

Dual-Route Processing Model

Our approach is inspired by dual-process theories in cognitive psychology, implementing two parallel processing pathways:

Route 1: Perceptual Processing

  • Input: Raw visual features extracted using VGG-16/19 and ResNet architectures
  • Processing: Direct feature mapping to impression dimensions
  • Characteristics: Fast, automatic, bottom-up processing
  • Output: Basic visual impression predictions

Route 2: Conceptual Processing

  • Input: Visual content described through natural language (via image captioning)
  • Processing: GPT-based semantic understanding and contextual reasoning
  • Characteristics: Slower, controlled, top-down processing
  • Output: Context-aware impression predictions

Integration Layer

  • Fusion Strategy: Learned attention mechanisms to weight perceptual vs. conceptual routes
  • Adaptive Weighting: Context-dependent route importance
  • Final Prediction: Ensemble output optimized for human impression alignment

Technical Implementation

Data Collection and Processing

Visual Content Dataset

  • Scale: 10+ years of web news images and associated text
  • Sources: Major news websites, social media platforms, art databases
  • Size: 2.5 million images with corresponding textual descriptions
  • Annotations: Human impression ratings on 20+ perceptual dimensions

Impression Dimensions

  • Competence, Warmth, Strength, Charisma, Practicality, etc.

Results and Performance

Quantitative Results

Impression Prediction Performance

  • Baseline Model (VGG-16 only): R² = 0.52, Spearman r = 0.72
  • GPT-only Model: R² = 0.47, Spearman r = 0.69
  • Dual-Route Model: R² = 0.76, Spearman r = 0.87 (+46% variance explained vs. baseline)
  • Human-AI Agreement: Pearson r = 0.84 with averaged human ratings (human inter-rater reliability: ICC = 0.81)

Error Analysis

  • Edge Cases: Highly abstract or symbolic content
  • Improvement Areas: Better handling of contextual nuances

Technical Innovations

Cross-Modal Alignment

  • Contrastive Learning: Align visual and textual representations in shared space
  • Mutual Information Maximization: Ensure complementary route information
  • Semantic Consistency: Maintain coherence between modalities

Benchmark Creation

  • VisualImpression Dataset: New benchmark for human impression prediction
  • Evaluation Metrics: Novel metrics for multimodal impression assessment
  • Baseline Comparisons: Systematic evaluation against existing methods

Applications and Impact

Real-World Applications

Digital Marketing

  • Ad Effectiveness Prediction: Predict user engagement before campaign launch
  • Creative Optimization: Automatically adjust visual content for target audiences
  • Brand Impression Management: Monitor and optimize brand visual presence

User Experience Design

  • Interface Optimization: Predict user reactions to UI/UX designs
  • Content Curation: Personalized visual content recommendation
  • Accessibility Enhancement: Predict usability across diverse user groups

Future Development

Technical Roadmap

Version 2.0 Enhancements

  • Video Analysis: Extend to temporal visual content

Technical Stack

Deep Learning Frameworks

  • PyTorch 2.0 (Primary framework)
  • TensorFlow 2.x (Comparison models)
  • Hugging Face Transformers
  • OpenAI API (GPT integration)

Computer Vision

  • OpenCV (Image processing)
  • Pillow (Image manipulation)
  • torchvision (Model architectures)
  • CLIP (Vision-language models)

Data Processing

  • pandas (Data manipulation)
  • numpy (Numerical computing)
  • scipy (Statistical analysis)
  • scikit-learn (ML utilities)

Infrastructure

  • CUDA 11.8 (GPU acceleration)
  • Docker (Containerization)
  • MLflow (Experiment tracking)
  • Weights & Biases (Model monitoring)

Team and Collaboration

  • Amber X. Chen (Principal Investigator & Lead Developer)
  • Dr. Hongbo Yu (Faculty Advisor)
  • Dr. Ruonan Cao & Dr. Shuo Wang (Collaborators in Neuroscience, WUSTL)

VisualGPT represents a significant advancement in AI’s ability to understand and predict human visual perception. For collaboration opportunities, technical discussions, or access to models and datasets, please contact amber.chen@psych.ucsb.edu.