VisualGPT: Multimodal AI for User Perception
Published:
Overview
VisualGPT represents a breakthrough in multimodal AI that bridges the gap between computational vision models and human perception. By developing a dual-route processing framework that integrates state-of-the-art language models (GPT) with computer vision architectures (VGG, CLIP), we achieved a 20% improvement in predicting human impressions of visual content compared to baseline models.
Project Motivation
Traditional computer vision models excel at object recognition but fail to capture the nuanced human impressions that drive real-world decisions. Our research addresses this gap by:
- Modeling Human Perception: Creating AI systems that predict human emotional and cognitive responses to visual content
- Multimodal Integration: Combining textual context with visual features for more accurate predictions
- Practical Applications: Enabling better user experience design, content curation, and human-AI interaction
Theoretical Framework
Dual-Route Processing Model
Our approach is inspired by dual-process theories in cognitive psychology, implementing two parallel processing pathways:
Route 1: Perceptual Processing
- Input: Raw visual features extracted using VGG-16/19 and ResNet architectures
- Processing: Direct feature mapping to impression dimensions
- Characteristics: Fast, automatic, bottom-up processing
- Output: Basic visual impression predictions
Route 2: Conceptual Processing
- Input: Visual content described through natural language (via image captioning)
- Processing: GPT-based semantic understanding and contextual reasoning
- Characteristics: Slower, controlled, top-down processing
- Output: Context-aware impression predictions
Integration Layer
- Fusion Strategy: Learned attention mechanisms to weight perceptual vs. conceptual routes
- Adaptive Weighting: Context-dependent route importance
- Final Prediction: Ensemble output optimized for human impression alignment
Technical Implementation
Data Collection and Processing
Visual Content Dataset
- Scale: 10+ years of web news images and associated text
- Sources: Major news websites, social media platforms, art databases
- Size: 2.5 million images with corresponding textual descriptions
- Annotations: Human impression ratings on 20+ perceptual dimensions
Impression Dimensions
- Competence, Warmth, Strength, Charisma, Practicality, etc.
Results and Performance
Quantitative Results
Impression Prediction Performance
- Baseline Model (VGG-16 only): R² = 0.52, Spearman r = 0.72
- GPT-only Model: R² = 0.47, Spearman r = 0.69
- Dual-Route Model: R² = 0.76, Spearman r = 0.87 (+46% variance explained vs. baseline)
- Human-AI Agreement: Pearson r = 0.84 with averaged human ratings (human inter-rater reliability: ICC = 0.81)
Error Analysis
- Edge Cases: Highly abstract or symbolic content
- Improvement Areas: Better handling of contextual nuances
Technical Innovations
Cross-Modal Alignment
- Contrastive Learning: Align visual and textual representations in shared space
- Mutual Information Maximization: Ensure complementary route information
- Semantic Consistency: Maintain coherence between modalities
Benchmark Creation
- VisualImpression Dataset: New benchmark for human impression prediction
- Evaluation Metrics: Novel metrics for multimodal impression assessment
- Baseline Comparisons: Systematic evaluation against existing methods
Applications and Impact
Real-World Applications
Digital Marketing
- Ad Effectiveness Prediction: Predict user engagement before campaign launch
- Creative Optimization: Automatically adjust visual content for target audiences
- Brand Impression Management: Monitor and optimize brand visual presence
User Experience Design
- Interface Optimization: Predict user reactions to UI/UX designs
- Content Curation: Personalized visual content recommendation
- Accessibility Enhancement: Predict usability across diverse user groups
Future Development
Technical Roadmap
Version 2.0 Enhancements
- Video Analysis: Extend to temporal visual content
Technical Stack
Deep Learning Frameworks
- PyTorch 2.0 (Primary framework)
- TensorFlow 2.x (Comparison models)
- Hugging Face Transformers
- OpenAI API (GPT integration)
Computer Vision
- OpenCV (Image processing)
- Pillow (Image manipulation)
- torchvision (Model architectures)
- CLIP (Vision-language models)
Data Processing
- pandas (Data manipulation)
- numpy (Numerical computing)
- scipy (Statistical analysis)
- scikit-learn (ML utilities)
Infrastructure
- CUDA 11.8 (GPU acceleration)
- Docker (Containerization)
- MLflow (Experiment tracking)
- Weights & Biases (Model monitoring)
Team and Collaboration
- Amber X. Chen (Principal Investigator & Lead Developer)
- Dr. Hongbo Yu (Faculty Advisor)
- Dr. Ruonan Cao & Dr. Shuo Wang (Collaborators in Neuroscience, WUSTL)
VisualGPT represents a significant advancement in AI’s ability to understand and predict human visual perception. For collaboration opportunities, technical discussions, or access to models and datasets, please contact amber.chen@psych.ucsb.edu.
