Topical Clustering of Unlabeled Transformer-Encoded Researcher Activity
Main Article Content
Keywords
topical clustering, document similarity, document encoding, BERT, natural language processing, clustering, k-means, DBSCAN, keyword extraction
Abstract
Transformer models have the ability to understand the meaning of text efficiently through the use of self-attention mechanisms. We investigate the bundled meanings in clusters of transformer-generated embeddings by evaluating the topical clustering accuracy of the unlabeled scientific papers of the DIT publications database. After experimenting with SciBERT and German-BERT, we focus on mBERT as we work with multilingual papers. We create a landscape representation of the scientific fields with active research through the encoding and clustering of research publications. With the absence of topic labels in the data (no ground truth), the clustering metrics cannot evaluate the accuracy of the topical clustering. Therefore, we make use of the coauthorship aspect in the papers to perform a coauthorship analysis in two parts: the investigation of the authors’ uniqueness in each cluster and the construction of coauthorship-based social networks. The calculated high uniqueness of authors in the formed clusters and the found homogeneity of topics across the connected components (in social networks) imply an accurate topical clustering of our encodings. Moreover, the constructed social networks indicate the existence of a set of connecting internal authors, whose collaborations with each other formed a large network, holding 74% of all papers in the database.