Rare Disease Article K-Means Cluster Analysis
Only about 60% of articles for the rare disease Spinal Bulbar Muscular Atrophy can be retrieved from PubMed using the disease name keywords. Why? Is it because of inconsistent disease naming? It is known PubMed uses article titles to index articles. Efficient article retrieval is critical for rare-disease researchers. Why is disease name not enough to retrieve articles about SBMA? How can we improve article retrieval? The full code, including the Fuseki Application that resulted from the clustering can be found here.
To answer the question of disease naming over time, a heatmap was created. This lead to the realization that although disease naming changed over time, there wasn't anything discriminating about disease naming. Yes, SBMA became more popular contemporaneously, but the constituent words in SBMA: Spinal, Bulbar, Muscular stayed fairly consistent. Therefore, the building of a cluster proceeded as outlined below.
Building the Cluster:
Full code and data in my github
Assuming you have a python data science environment up, like Anaconda, the additional requirements are:
- My TextCleaner app available here.
Part I: Data Preprocessing
Take care of imports:
import numpy as np import pandas as pd from pandas import DataFrame,Series #Preprocessing import nltk import re from TextCleaner2000.TextCleaner import TextCleaner #Machine Learning from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Normalizer from sklearn.decomposition import TruncatedSVD from sklearn.cluster import KMeans #For interactive cluster from sklearn.metrics.pairwise import cosine_similarity from sklearn.manifold import MDS #Visuals import matplotlib.pyplot as plt import mpld3
#Load Data wian_df=pd.read_csv('DATA.OLD\\SBMA_PYTHON.csv')
#Use my stop word remove tool to clean text for the tokens #Set the directory where you have downloaded and unzipped WordCleaner cleaner = TextCleaner("TextCleaner2000")
#Remove non-alphanumeric characters and replace with blank; returns lowered words clean_list = cleaner.alpha_iterator(wian_df["TITLE"]) #Remove commons stop-words clean_list_beta= cleaner.stop_word_iterator(clean_list)
# here I define a tokenizer which returns a sentence as a list of words def tokenize_only(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens =  # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens
#Here, this is where my text cleaner comes in handy: we can pass this custom stop-word list below remove_words = ('sbma', 'spinal muscular', 'spinal bulbar', 'spinal', 'muscular','atrophy', 'bulbar','spinal muscular atrophy', 'spinal bulbar muscular atrophy', "kennedy's", 'kennedy', 'disease', 'diseases', 'syndrome', 'atrophy', 'kennedy-', 's', 'patient','type','bulbospinal','case','study','clinical','adult','family','patients', 'disorders', 'report','bulbo','mouse','model','spinobulbar','analysis','disorders')
#Remove the custom stop-words. We want to know what is in the other 40% of articles, which is not disease name titles = cleaner.custom_stop_word_iterator(clean_list_beta, remove_words)
Part II: Defining TFIDF Vector and Prepare for K-Means
We are interested in finding how articles are related to one another to find a pattern. In text, the pattern is represented by related words. Words that are "close" to eachother in a Euclidean space. But, we can't measure text like we can numbers, so we need to represent text as numbers. But how?
First, we need to find out what words are important. This is where TFIDF comes in. TFIDF or term-frequency inverse-document-frequency measures the importance of a word by penalizing popular words, through weights. Once this is done, we can define a similarity function for non-linear dimensionality reduction, find the best number of clusters, tune the hyper-parameters and train the model. Phew! That's alot, but hopefully it wil make more sense as we take it step-by-step.
from sklearn.feature_extraction.text import TfidfVectorizer #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.01, stop_words='english', use_idf=True, ngram_range=(1,3), tokenizer=tokenize_only) %time tfidf_matrix = tfidf_vectorizer.fit_transform(titles) #fit the vectorizer to synopses print(tfidf_matrix.shape)
#We will use these terms to define the words in the clusters terms = tfidf_vectorizer.get_feature_names()
Here, we use two popular algorithms to normalize the TF-IDF matrix.
# Vectorizer results are normalized, which makes KMeans behave as # spherical k-means for better results. Since LSA/SVD results are # not normalized, we have to redo the normalization. svd = TruncatedSVD(5) normalizer = Normalizer(copy=False) lsa = make_pipeline(svd, normalizer) X = lsa.fit_transform(tfidf_matrix)
We are using an unsupervised algorithm. There is no data to tell the algorithm: "Hey, this looks good. Let's stop..." However, we are not completely in the dark when it comes to finding the right parameters.
Here, using the "Elbow Method", we choose the number of clusters, based on choosing a point in the graph below. The point we want is the point where the distortions rise dramatically. Distortions are the squared distances between a sample and the closest cluster. The lower the distortion, the higher the chance a sample finds a cluster. After experimentation, I chose 8.
distortions= for i in range(1,30): km=KMeans(n_clusters=i, init='random', n_init=10, max_iter=300, random_state=0) km.fit(X) distortions.append(km.inertia_) plt.plot(range(1,30), distortions, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Distortion') plt.show()
Part III: K-Means
Now we can fit the k-means+ model to the normalized matrix X.
num_clusters = 8 km = KMeans(n_clusters=num_clusters, init='k-means++', n_init=260,random_state=0, max_iter=700) %time km.fit(X) clusters = km.labels_.tolist()
In order to visual the cluster, we need to to compute the difference between two points within the TF-IDF amtrix. That's exactly what the cosine similarity function does.
# We need a dissimilarity function to plot the points in the cluster # We pass this function to the MDS algorithm and we get our x and y coordinates for the cluster dist = 1 - cosine_similarity(tfidf_matrix) print()
Here, the MDS algorithm converts the dist output from above into a two-dimensional space.
MDS() # convert two components as we're plotting points in a two-dimensional plane # "precomputed" because we provide a distance matrix # we will also specify `random_state` so the plot is reproducible. mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1, metric=True, eps=1e-5, max_iter = 700, n_init=15, n_jobs = -1) pos = mds.fit_transform(dist) # shape (n_components, n_samples) xs, ys = pos[:, 0], pos[:, 1] print() print()
The model is mostly finished. The last few steps are assigning the words to the labels in the cluster legend and making an interactive visualization with the mpld3 library; I'll save these steps for the source. For the full code and data see the git page. Below is the cluster.
What is striking about the cluster is the fact that what predominates the titles that do not have the various disease names are the terms that describe the disease. Indeed, we see terms like: Polyglutamine Protein, CAG Repeat, Androgen Receptor and Trinucleotide Repeat. At first, it doesn't really make a lot of sense until we look up the definition of SBMA. From wikipedia:
Spinal and bulbar muscular atrophy (SBMA), popularly known as Kennedy's disease, is a progressive debilitating neurodegenerative disorder resulting in muscle cramps and progressive weakness due to degeneration of motor neurons in the brainstem and spinal cord. The condition is associated with mutation of the androgen receptor (AR) gene and is inherited in an X-linked recessive manner...
What is clear from the cluster is that the terms in this rare disease's article titles are the pathophysiological terms (we see CAG Repeat and Androgen Receptor in light purple for example) that describe the presentation and cause of this disease. So, we answered the first question straight-away: rare disease names don't even apepar in 60% of articles, and since PubMed doesn't index rare diseases to the extent of more well-known diseases, we are left searching by disease names alone. The fact that we have shown for this rare disease