Rare Disease Article K-Means Cluster Analysis


Only about 60% of articles for the rare disease Spinal Bulbar Muscular Atrophy can be retrieved from PubMed using the disease name keywords. Why? Is it because of inconsistent disease naming? It is known PubMed uses article titles to index articles. Efficient article retrieval is critical for rare-disease researchers. Why is disease name not enough to retrieve articles about SBMA? How can we improve article retrieval? The full code, including the Fuseki Application that resulted from the clustering can be found here.

See the cluster here.


To answer the question of disease naming over time, a heatmap was created. This lead to the realization that although disease naming changed over time, there wasn't anything discriminating about disease naming. Yes, SBMA became more popular contemporaneously, but the constituent words in SBMA: Spinal, Bulbar, Muscular stayed fairly consistent. Therefore, the building of a cluster proceeded as outlined below.

Building the Cluster:

Full code and data in my github

Assuming you have a python data science environment up, like Anaconda, the additional requirements are:

Part I: Data Preprocessing

Take care of imports:

            import numpy as np
            import pandas as pd
            from pandas import DataFrame,Series
            import nltk
            import re
            from  TextCleaner2000.TextCleaner import TextCleaner
            #Machine  Learning
            from sklearn.feature_extraction.text import TfidfVectorizer
            from sklearn.feature_extraction.text import TfidfTransformer
            from sklearn.pipeline import make_pipeline
            from sklearn.preprocessing import Normalizer
            from sklearn.decomposition import TruncatedSVD
            from sklearn.cluster import KMeans
            #For interactive cluster
            from sklearn.metrics.pairwise import cosine_similarity
            from sklearn.manifold import MDS
            import matplotlib.pyplot as plt
            import mpld3

            #Load Data

            #Use my stop word remove tool to clean text for the tokens
            #Set the directory where you have downloaded and unzipped WordCleaner
            cleaner = TextCleaner("TextCleaner2000")
            #Remove non-alphanumeric characters and replace with blank; returns lowered words
            clean_list = cleaner.alpha_iterator(wian_df["TITLE"])

            #Remove commons stop-words
            clean_list_beta= cleaner.stop_word_iterator(clean_list)
            # here I define a tokenizer which returns a sentence as a list of words
            def tokenize_only(text):
            #   first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
                tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
                filtered_tokens = []
                # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
                for token in tokens:
                    if re.search('[a-zA-Z]', token):
                return filtered_tokens
            #Here, this is where my text cleaner comes in handy: we can pass this custom stop-word list below
            remove_words = ('sbma', 'spinal muscular',
                    'spinal bulbar', 'spinal', 'muscular','atrophy', 'bulbar','spinal muscular atrophy', 
                    'spinal bulbar muscular atrophy', "kennedy's", 'kennedy', 'disease', 'diseases', 'syndrome',
                    'atrophy', 'kennedy-', 's', 'patient','type','bulbospinal','case','study','clinical','adult','family','patients',
                    'disorders', 'report','bulbo','mouse','model','spinobulbar','analysis','disorders')
            #Remove the custom stop-words. We want to know what is in the other 40% of articles, which is not disease name
            titles = cleaner.custom_stop_word_iterator(clean_list_beta, remove_words)   

Part II: Defining TFIDF Vector and Prepare for K-Means

We are interested in finding how articles are related to one another to find a pattern. In text, the pattern is represented by related words. Words that are "close" to eachother in a Euclidean space. But, we can't measure text like we can numbers, so we need to represent text as numbers. But how?

First, we need to find out what words are important. This is where TFIDF comes in. TFIDF or term-frequency inverse-document-frequency measures the importance of a word by penalizing popular words, through weights. Once this is done, we can define a similarity function for non-linear dimensionality reduction, find the best number of clusters, tune the hyper-parameters and train the model. Phew! That's alot, but hopefully it wil make more sense as we take it step-by-step.

            from sklearn.feature_extraction.text import TfidfVectorizer

            #define vectorizer parameters
            tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                                 min_df=0.01, stop_words='english',
                                                 use_idf=True, ngram_range=(1,3), tokenizer=tokenize_only)
            %time tfidf_matrix = tfidf_vectorizer.fit_transform(titles) #fit the vectorizer to synopses
            #We will use these terms to define the words in the clusters
            terms = tfidf_vectorizer.get_feature_names()

Here, we use two popular algorithms to normalize the TF-IDF matrix.

            # Vectorizer results are normalized, which makes KMeans behave as
            # spherical k-means for better results. Since LSA/SVD results are
            # not normalized, we have to redo the normalization.
            svd = TruncatedSVD(5)
            normalizer = Normalizer(copy=False)
            lsa = make_pipeline(svd, normalizer)
            X = lsa.fit_transform(tfidf_matrix)

We are using an unsupervised algorithm. There is no data to tell the algorithm: "Hey, this looks good. Let's stop..." However, we are not completely in the dark when it comes to finding the right parameters.

Here, using the "Elbow Method", we choose the number of clusters, based on choosing a point in the graph below. The point we want is the point where the distortions rise dramatically. Distortions are the squared distances between a sample and the closest cluster. The lower the distortion, the higher the chance a sample finds a cluster. After experimentation, I chose 8.

            for i in range(1,30):
            plt.plot(range(1,30), distortions, marker='o')
            plt.xlabel('Number of clusters')


Part III: K-Means

Now we can fit the k-means+ model to the normalized matrix X.

            num_clusters = 8
            km = KMeans(n_clusters=num_clusters, init='k-means++', n_init=260,random_state=0, max_iter=700)
            %time km.fit(X)
            clusters = km.labels_.tolist()

In order to visual the cluster, we need to to compute the difference between two points within the TF-IDF amtrix. That's exactly what the cosine similarity function does.

            # We need a dissimilarity function to plot the points in the cluster
            # We pass this function to the MDS algorithm and we get our x and y coordinates for the cluster 
            dist = 1 - cosine_similarity(tfidf_matrix)

Here, the MDS algorithm converts the dist output from above into a two-dimensional space.

            # convert two components as we're plotting points in a two-dimensional plane
            # "precomputed" because we provide a distance matrix
            # we will also specify `random_state` so the plot is reproducible.
            mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1, metric=True, eps=1e-5, 
                        max_iter = 700, n_init=15, n_jobs = -1)
            pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
            xs, ys = pos[:, 0], pos[:, 1]

The model is mostly finished. The last few steps are assigning the words to the labels in the cluster legend and making an interactive visualization with the mpld3 library; I'll save these steps for the source. For the full code and data see the git page. Below is the cluster.


What is striking about the cluster is the fact that what predominates the titles that do not have the various disease names are the terms that describe the disease. Indeed, we see terms like: Polyglutamine Protein, CAG Repeat, Androgen Receptor and Trinucleotide Repeat. At first, it doesn't really make a lot of sense until we look up the definition of SBMA. From wikipedia:

Spinal and bulbar muscular atrophy (SBMA), popularly known as Kennedy's disease, is a progressive debilitating neurodegenerative disorder resulting in muscle cramps and progressive weakness due to degeneration of motor neurons in the brainstem and spinal cord.[2] The condition is associated with mutation of the androgen receptor (AR) gene[4][5] and is inherited in an X-linked recessive manner...

What is clear from the cluster is that the terms in this rare disease's article titles are the pathophysiological terms (we see CAG Repeat and Androgen Receptor in light purple for example) that describe the presentation and cause of this disease. So, we answered the first question straight-away: rare disease names don't even apepar in 60% of articles, and since PubMed doesn't index rare diseases to the extent of more well-known diseases, we are left searching by disease names alone. The fact that we have shown for this rare disease