Information Spread on Social Media

Published: January 01, 2023

Overview

This project represents a comprehensive analysis of information spread patterns across multiple social media platforms, focusing on understanding how information propagates and what makes content go viral. By analyzing over 20 million multilingual posts from X (Twitter) and Weibo, we developed predictive models for viral content identification. Our best-performing model (XGBoost with engineered features) achieved an F1-score of 0.82 and AUC-ROC of 0.89 on held-out test data, with precision of 0.79 and recall of 0.85 for the viral content class (defined as posts reaching >1000 engagements within 24 hours).

Research Questions

How does misinformation spread differently across cultural contexts?
What linguistic and behavioral patterns predict viral content?
How do user behavior patterns differ between Eastern and Western social media platforms?

Methodology

Data Collection

Scale: 20+ million multilingual social media posts
Platforms: X (Twitter) and Weibo
Languages: English, Japanese, and Chinese
Time Period: Multi-year longitudinal analysis
Behavioral Experiments: Sharing intention experiment in the US and China (N = 400), eye-tracking (N = 100)

Sampling Strategy

This project comprises multiple studies with different sampling approaches tailored to each research question:

Study 1: Emotional and Moral Expression Dynamics (Twitter)

Method: Stratified random sampling—100 tweets sampled every hour for 12 months
Scope: English tweets geolocated to the US; Japanese tweets geolocated to Japan
Purpose: Examine how emotional and moral expressions drive information spread across cultural contexts
Total: ~876,000 tweets per language

Study 2: Controversial Topic Discourse (Weibo)

Method: Keyword-based purposive sampling targeting three controversial social topics
Scope: Chinese-language posts and associated discussion threads
Purpose: Analyze discourse patterns and information cascade structures around contentious issues

Study 3: Misinformation Network Structure (Twitter)

Method: Keyword-based sampling of verified misinformation and matched true information
Scope: Posts identified through fact-checking databases and matched controls
Purpose: Compare network propagation structures between true and false information

Technical Implementation

Natural Language Processing Pipeline

BERT Models: Fine-tuned multilingual BERT for semantic understanding
Translation APIs: Google Translate and Microsoft Translator for cross-language analysis
Sentiment Analysis: Custom models trained on social media data
Topic Modeling: Latent Dirichlet Allocation (LDA) and BERTopic

Machine Learning Models

Classification: Random Forest, XGBoost, and neural networks for viral prediction
Time Series Analysis: ARIMA models for trend prediction
Network Analysis: Graph neural networks for information flow modeling
Feature Engineering: 100+ linguistic, temporal, and social features

Infrastructure

Computing: High-Performance Computing clusters
Storage: Distributed database systems for large-scale data
Processing: PySpark for parallel processing
APIs: Custom APIs for real-time data collection

Key Findings

Drivers of Viral Content

We identified two key drivers of viral content spread:

Heterogeneous Emotion: Posts expressing mixed or diverse emotional content (e.g., combining anger with hope) spread more widely than those with uniform emotional tone
Homogeneous Morality: Posts with consistent moral framing (e.g., unified appeals to fairness or loyalty) achieve greater reach than those with conflicting moral messages

Emotion Contagion

Western Platforms: Emotional content spreads faster, especially positive emotions
Eastern Platforms: Mixed emotional content spreads more easily
Language Effects: Certain linguistic structures promote faster spread

Misinformation Spread Patterns

Speed: Consistent with prior research (Vosoughi et al., 2018), we observed that false information spreads significantly faster than verified information in our dataset
Network Structure: Misinformation creates denser, more connected networks (measured by clustering coefficient and average path length)
User Behavior: False information paired with social approval cues (e.g., high like/retweet counts) showed 2.3x higher sharing intention in our behavioral experiments

Impact and Applications

Research Contributions

Oral presentation at SAS 2022

Practical Applications

Social media monitoring tools for public health organizations
Content recommendation improvements for social media platforms

Policy Implications

Provided evidence for regulatory discussions on information quality
Supported digital literacy initiatives

Technologies Used

Programming Languages

Python (Primary)
SQL (Data Management)
JavaScript (Visualization)

Machine Learning

BERT, RoBERTa, XLM-R
PyTorch, TensorFlow
Scikit-learn
Hugging Face Transformers

Data Processing

PySpark
Pandas, NumPy
MongoDB, PostgreSQL

Visualization

D3.js
Plotly
NetworkX
Tableau

Data Availability

Code: Available on OSF Repository
Preprocessed Data: Available upon request (due to privacy considerations)
Documentation: Comprehensive documentation included

Publications and Presentations

Chen, Y., Chen, A. X., Yu, H., & Sun, S. (2023). “Unraveling moral and emotional discourses on social media: a study of three cases.” Information, Communication & Society. DOI: 10.1080/1369118X.2023.2246551

Future Directions

Technical Improvements

Integration of multimodal analysis (text + images + videos)
Real-time adaptation using online learning algorithms
Causal inference methods for understanding information spread mechanisms

Research Extensions

Long-term longitudinal studies spanning multiple years
Integration with offline behavior data
Cross-platform analysis including emerging social media platforms

Collaboration Opportunities

Partnership with social media platforms for data access
Collaboration with public health organizations
Integration with fact-checking organizations

Team and Acknowledgments

Core Team

Amber X. Chen (Lead Data Scientist)
Dr. Hongbo Yu (Faculty Advisor, UCSB)
Dr. Helene Fung (Faculty Advisor, CUHK)
Dr. Yibei Chen (Research Collaborator, MIT)
Dr. Shaojing Sun (Research Collaborator, Fudan University)

Funding

Research Grants Council of Hong Kong General Research Fund
Interdisciplinary Research Seed Funding, CUHK

Computing Resources

UCSB Center for Scientific Computing
CUHK High Performance Computing Centre
Google Cloud Platform

For more information about this project or to request access to data and code, please contact amber.chen@psych.ucsb.edu.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Amber X. Chen