π Excited to share a new dataset preprint!
WikiReddit: Tracing Information and Attention Flows Between Online Platforms. A collaboration with Anna Beers, Viviane Ito, Agustin Orozco, and Francesca Tripodi.
π We introduce a new dataset of 12.2M Wikipedia links shared on Reddit over four years, offering a new lens on how two of the webβs most influential platforms interact.
- π Preprint: https://doi.org/10.48550/arXiv.2502.04942
- π½ Dataset: https://doi.org/10.5281/zenodo.14653265
Our dataset enables further research on:
- π° How Reddit drives attention to Wikipedia
- π Wikipediaβs (mis)use as a trusted source in online discourse
- βοΈ How different communities engage with Wikipedia, revealing knowledge gaps
- π The interplay of demographic and platform biases on Reddit and Wikipedia
π The SQL database contains:
- All Reddit posts and comments mentioning Wikipedia 2020-23, including hyperlinks, hashed IDs, and metadata.
- Edit history and page view activity for Wikipedia articles at the time of posting, page IDs, and redirects.

π· Care has been taken to omit and anonymise any potentially personally identifiable information. Future researchers with access to the Reddit and Wikipedia APIs can enrich their analyses by pulling extra data (e.g. post content) and linking to the processed cross-platform information provided here.
On exploratory analysis, we find:
- π Declining activity in posts, but stable performance of Wikipedia content on Reddit
- π Strong correlations between Wikipedia and Reddit activity
- π Intriguing asymmetric patterns of cross-lingual linking, dominated by English



π Thank you to Wikimedia Research and the reddit4researchers programme for the data access. I spoke more about this and other projects as part of the January Wikimedia research showcase, which you can watch here: https://www.youtube.com/live/gvF8p4r91NE?t=2177s