We present PAPYER , a dataset collected for narrative discovery in social media discussions using a human in the loop pipeline and a combination of human and machine kernels. Our method relies on SNaCK and compares favorably to prior topic modelling and transformer-based methods and discovers a low dimensional space of narratives revolving around hygiene in public restrooms using less than 30k triplets. Our long-term vision is analogous to Visipedia: we wish to capture and share human narratives in online discussions across a wide array of topics. The present work represents our first foray in this direction, with a deep dive into a single topic.
Social media platforms give rise to an abundance of posts and comments on every topic imaginable. Many of these posts express opinions on various aspects of society, but their unfalsifiable nature makes them ill-suited to fact-checking pipelines. Understanding and visualizing these narratives can facilitate more informed debates on social media. As a first step towards systematically identifying the underlying narratives on social media, we introduce, PAPYER , a fine-grained dataset of online comments related to hygiene in public restrooms, which contains a multitude of unfalsifiable claims. We present a human-in-the-loop pipeline that uses a combination of machine and human kernels to discover the prevailing narratives and show that this pipeline outperforms recent large transformer models and state-of-the-art unsupervised topic models.
Our human-in-the-loop pipeline works in the following 3 steps: first we initially embed the sentences using a transformer (T5 for example), second we sample two random sentences, one to be the anchor of a HIT and the other to be one of 5 additional text snippets that workers select from. The final 4 sentences are sampled based on the L2 distance of their transformer embedding to that of the anchor. Third, we gather high quality triplets from Amazon mechanical turk and use them as human annotation to used to train the SNaCK embedding; see our paper for details.
Visualisation of human annotations. Each line represents a positive (left) or negative (right) human annotation for an anchor. The histograms in the circumference describe the number of incoming connections. The color of the histogram describes the class of the majority of incoming classes and the color of the lines describes the ground truth class of the anchor. If the color of the histogram matches the above class color, then the pair belongs to the ground truth class, and if the colors differ they do not. The numbers above the class color indicate the individual claim classes.
We compare our method (SNaCK) to LDA, T-SNE (using both BERT and T5 transformers) UMAP (using both BERT and T5 transformers) and LDA-transformer (contextualizing the topics using information form a transformer visualized using T-SNE). Experiments are conducted on the PAPYER dataset and SNaCK achieves superior triplet generalization ratio, albeit resulting in a more noisy embedding than it's initial embedding.
@article{christensen2022PAPYER,
author = {Christensen, Peter E. and Warburg, Frederik and Jia, Menglin and Belongie, Serge},
title = {Searching for Structure in Unfalsifiable Claims},
journal = {-},
year = {2022},
}