Information Spread on Social Media

Published:

Overview

This project represents a comprehensive analysis of information spread patterns across multiple social media platforms, focusing on understanding how information propagates and what makes content go viral. By analyzing over 20 million multilingual posts from X (Twitter) and Weibo, we developed predictive models for viral content identification. Our best-performing model (XGBoost with engineered features) achieved an F1-score of 0.82 and AUC-ROC of 0.89 on held-out test data, with precision of 0.79 and recall of 0.85 for the viral content class (defined as posts reaching >1000 engagements within 24 hours).

Research Questions

  • How does misinformation spread differently across cultural contexts?
  • What linguistic and behavioral patterns predict viral content?
  • How do user behavior patterns differ between Eastern and Western social media platforms?

Methodology

Data Collection

  • Scale: 20+ million multilingual social media posts
  • Platforms: X (Twitter) and Weibo
  • Languages: English, Japanese, and Chinese
  • Time Period: Multi-year longitudinal analysis
  • Behavioral Experiments: Sharing intention experiment in the US and China (N = 400), eye-tracking (N = 100)

Sampling Strategy

This project comprises multiple studies with different sampling approaches tailored to each research question:

Study 1: Emotional and Moral Expression Dynamics (Twitter)

  • Method: Stratified random sampling—100 tweets sampled every hour for 12 months
  • Scope: English tweets geolocated to the US; Japanese tweets geolocated to Japan
  • Purpose: Examine how emotional and moral expressions drive information spread across cultural contexts
  • Total: ~876,000 tweets per language

Study 2: Controversial Topic Discourse (Weibo)

  • Method: Keyword-based purposive sampling targeting three controversial social topics
  • Scope: Chinese-language posts and associated discussion threads
  • Purpose: Analyze discourse patterns and information cascade structures around contentious issues

Study 3: Misinformation Network Structure (Twitter)

  • Method: Keyword-based sampling of verified misinformation and matched true information
  • Scope: Posts identified through fact-checking databases and matched controls
  • Purpose: Compare network propagation structures between true and false information

Technical Implementation

Natural Language Processing Pipeline

  • BERT Models: Fine-tuned multilingual BERT for semantic understanding
  • Translation APIs: Google Translate and Microsoft Translator for cross-language analysis
  • Sentiment Analysis: Custom models trained on social media data
  • Topic Modeling: Latent Dirichlet Allocation (LDA) and BERTopic

Machine Learning Models

  • Classification: Random Forest, XGBoost, and neural networks for viral prediction
  • Time Series Analysis: ARIMA models for trend prediction
  • Network Analysis: Graph neural networks for information flow modeling
  • Feature Engineering: 100+ linguistic, temporal, and social features

Infrastructure

  • Computing: High-Performance Computing clusters
  • Storage: Distributed database systems for large-scale data
  • Processing: PySpark for parallel processing
  • APIs: Custom APIs for real-time data collection

Key Findings

Drivers of Viral Content

We identified two key drivers of viral content spread:

  • Heterogeneous Emotion: Posts expressing mixed or diverse emotional content (e.g., combining anger with hope) spread more widely than those with uniform emotional tone
  • Homogeneous Morality: Posts with consistent moral framing (e.g., unified appeals to fairness or loyalty) achieve greater reach than those with conflicting moral messages

Emotion Contagion

  • Western Platforms: Emotional content spreads faster, especially positive emotions
  • Eastern Platforms: Mixed emotional content spreads more easily
  • Language Effects: Certain linguistic structures promote faster spread

Misinformation Spread Patterns

  • Speed: Consistent with prior research (Vosoughi et al., 2018), we observed that false information spreads significantly faster than verified information in our dataset
  • Network Structure: Misinformation creates denser, more connected networks (measured by clustering coefficient and average path length)
  • User Behavior: False information paired with social approval cues (e.g., high like/retweet counts) showed 2.3x higher sharing intention in our behavioral experiments

Impact and Applications

Research Contributions

  • Oral presentation at SAS 2022

Practical Applications

  • Social media monitoring tools for public health organizations
  • Content recommendation improvements for social media platforms

Policy Implications

  • Provided evidence for regulatory discussions on information quality
  • Supported digital literacy initiatives

Technologies Used

Programming Languages

  • Python (Primary)
  • SQL (Data Management)
  • JavaScript (Visualization)

Machine Learning

  • BERT, RoBERTa, XLM-R
  • PyTorch, TensorFlow
  • Scikit-learn
  • Hugging Face Transformers

Data Processing

  • PySpark
  • Pandas, NumPy
  • MongoDB, PostgreSQL

Visualization

  • D3.js
  • Plotly
  • NetworkX
  • Tableau

Data Availability

  • Code: Available on OSF Repository
  • Preprocessed Data: Available upon request (due to privacy considerations)
  • Documentation: Comprehensive documentation included

Publications and Presentations

  1. Chen, Y., Chen, A. X., Yu, H., & Sun, S. (2023). “Unraveling moral and emotional discourses on social media: a study of three cases.” Information, Communication & Society. DOI: 10.1080/1369118X.2023.2246551

Future Directions

Technical Improvements

  • Integration of multimodal analysis (text + images + videos)
  • Real-time adaptation using online learning algorithms
  • Causal inference methods for understanding information spread mechanisms

Research Extensions

  • Long-term longitudinal studies spanning multiple years
  • Integration with offline behavior data
  • Cross-platform analysis including emerging social media platforms

Collaboration Opportunities

  • Partnership with social media platforms for data access
  • Collaboration with public health organizations
  • Integration with fact-checking organizations

Team and Acknowledgments

Core Team

  • Amber X. Chen (Lead Data Scientist)
  • Dr. Hongbo Yu (Faculty Advisor, UCSB)
  • Dr. Helene Fung (Faculty Advisor, CUHK)
  • Dr. Yibei Chen (Research Collaborator, MIT)
  • Dr. Shaojing Sun (Research Collaborator, Fudan University)

Funding

  • Research Grants Council of Hong Kong General Research Fund
  • Interdisciplinary Research Seed Funding, CUHK

Computing Resources

  • UCSB Center for Scientific Computing
  • CUHK High Performance Computing Centre
  • Google Cloud Platform

For more information about this project or to request access to data and code, please contact amber.chen@psych.ucsb.edu.