Dissecting Scientific Papers with NLP
Github repo here.
"Language is the road map of a culture. It tells you where its people come from and where they are going."—Rita Mae Brown
Since starting my PhD in 2019, I’ve used text analysis to help me with various aspects of my research, including gearing up for my comprehensive exam. During this period, I delved into over a hundred research papers centered around two pivotal themes: "Novelty Reception," which explores how new ideas gain acceptance, and "Network and Gender," which examines how social networks impact the careers of men and women differently. These two stream of research are foundational to my work into the systemic barriers women face in creative fields.
I believe Rita Mae Brown's insight applies to scientific research as well, that the language employed in scientific papers not only reflects current understanding but also signals potential future directions a field might go. Here, I decode this language using NLP.
Phases of analysis
In this GitHub repository, you'll find a series of notebooks that employ different NLP techniques to dissect and understand the thematic structures embedded within these research papers. You can also find them here:
TF-IDF and weighted log odds: This notebook extracts metadata from academic citations and dives deep into the abstracts to pinpoint key terms using TF-IDF and weighted log odds. This analysis highlights distinctive terms within the "Novelty Reception" and "Network and Gender" research streams, giving us a sense of the specific language and themes that are most pivotal in these streams.
Topic modeling: This notebook uses Latent Dirichlet Allocation (LDA) for topic modeling of the 100+ paper abstracts. Specifically, it:
Transform text data, preparing it for deeper analysis.
Construct a document-term matrix to serve as the foundation for topic models.
Train a range of models (up to 130 topics) to determine the best fit based on thematic coherence.
Deliver insights through detailed analysis and visualizations of topic prevalence and coherence.
Investigate how different topics relate and cluster together using a dendrogram, offering a visual representation of topic relationships based on their co-occurrence within documents.
The dataset analyzed is available via a Google Sheets link provided in the first notebook. Thus, anyone can replicate the study by simply running the provided code. To save you time and computational resources, I've included the models trained in the second notebook as well.