Customer Churn ANN


Abstract

The data from kaggle contains 10,000 samples of a German bank's customers, some of which have closed their account. An artificial Neural Network was built to predict if a csutomer will leave. The final model's accuracy is around 85%. The model could be used to apply to current customers to "tag" customers that might be at risk to leave, to target with a campaign to prevent churn. The code for this project can be found here.

Introduction

The old business axiom of "it costs 7 times as much to get a new customer as it does to keep an existing one" is really what drives this project. If we can predict when a customer is going to leave, maybe we can take action to keep them. In this scenario, we have a bank that has account information for 10,000 customers, some of which have left the bank. Can we use the account details: CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard (Does the customer have a credit card?), IsActiveMember and EstimatedSalary to predict whether a customer will leave?


Methodology

The basic steps followed in this project:


Full code and data in my github

Assuming you have a python data science environment up, like Anaconda, the additional requirements are:

  • tensorflow
  • keras

Part I: Data Preprocessing

Take care of imports:

            # Importing the libraries
            import numpy as np
            import matplotlib.pyplot as plt
            import pandas as pd
                
            #Data Preprocessing
            from sklearn.preprocessing import LabelEncoder, OneHotEncoder
            from sklearn.preprocessing import StandardScaler
            from sklearn.model_selection import train_test_split
                
            #For the Neural Network
            from keras.models import Sequential
            from keras.layers import Dense
                
                
            #Testing
            from sklearn.metrics import confusion_matrix
            from keras.wrappers.scikit_learn import KerasClassifier
            from sklearn.model_selection import cross_val_score
            from sklearn.model_selection import GridSearchCV
       
        
        
    

Load our data. We are taking only the columns we specified in introduction.

       
    
            # Importing the dataset
            dataset = pd.read_csv('DATA.IN\\Churn_Modelling.csv')
            X = dataset.iloc[:, 3:13].values
            y = dataset.iloc[:, 13].values
        
    

We need to encode the categorical and ordinal features.

       
        
            labelencoder_X1 = LabelEncoder()
            X[:, 1] = labelencoder_X1.fit_transform(X[:, 1])
                    
            labelencoder_X2 = LabelEncoder()
            X[:, 2] = labelencoder_X2.fit_transform(X[:, 2])
                    
            onehotencoder = OneHotEncoder(categorical_features = [1])
            X = onehotencoder.fit_transform(X).toarray()
           
        
       
        
            # Drop one of the encoded country name columns
            X = X[:,1:]

            onehotencoder = OneHotEncoder(categorical_features = [1])
            X = onehotencoder.fit_transform(X).toarray()
                    
            # Drop one country column
            X = X[:,1:]
        
        

Develop training and testing datasets for initial run and perform mandatory scaling (feature scaling is a must for Neural Networks). Later, we will produce a final model with k-fold cross-validation.

        
            # Splitting the dataset into the Training set and Test set

            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
                        
            # Feature Scaling Always
                        
            sc = StandardScaler()
            X_train = sc.fit_transform(X_train) 
            X_test = sc.transform(X_test)
                        
        
            

Part II: Build Initial Model

For classification, we use the relu activation function, which controls what is sent to output layer. In the output layer, we use the sigmoid functions which returns either 0 or 1 in this case. We need to specify the number of nodes in the hidden layers. A trick is to choose the total number of independent variables plus the dependent variable +1 divided by two. So, in our case we have 11 independent variables and one dependent, obviously, so that gives us 12/2 = 6 nodes. Our first layer needs to have the input_dim passed.

       
        
            # In keras, we can build a network one layer at a time.
            #Initialize classifier
            classifier = Sequential()

            classifier.add(Dense(units=6, kernel_initializer='uniform', 
                     activation='relu', input_dim=11))
            
        
        

We don't have to add more layers (only input and output layers are required), but here we will.

            
            classifier.add(Dense(units=6, kernel_initializer='uniform', 
                    activation='relu'))
                    
            
        

Output layer with the sigmoid specified:

            
            classifier.add(Dense(units=1, kernel_initializer='uniform', 
                    activation='sigmoid'))
           
            
        

Now we can compile and fit the model. Compilation sets all of our options and fit fits the data to the model. We choose the appropriate loss to match our output function.

            
            classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics= ['accuracy'])
            classifier.fit(X_train,y_train, batch_size=256, epochs=100)
                            
            

        

After running the model we can get the confusion matrix, which we use to see how a classifier performs.

            
            y_pred = classifier.predict(X_test)

            #We need to convert probabilities into true or false values, with a threshhold
                    
            y_pred = (y_pred > 0.5)
                    
            # Making the Confusion Matrix
                    
            cm = confusion_matrix(y_test, y_pred)
            
        


Part III: k-fold cross-validation and GridSearch Hyperparameter Selection

We can use KerasClassifier, which is a wrapper around sklearn. We need this for the GridSearch Algorithm, or if we were just doing k-fold cross-validation. In order to use KerasClassifier, we need to pass a function that builds our required network.

            
            def classifier_builder(optimizer):
    
                classifier = Sequential()
                
                classifier.add(Dense(units=6, kernel_initializer='uniform', 
                                    activation='relu', input_dim=11))
                    
                    
                classifier.add(Dense(units=6, kernel_initializer='uniform', 
                                    activation='relu'))
                    
                    
                classifier.add(Dense(units=1, kernel_initializer='uniform', 
                                    activation='sigmoid'))
                    
                    
                classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics= ['accuracy'])
                    
                return classifier
                
            
        

Now we can initialize a KerasClassifer instance. Then we can define the parameters to pass to GridSearchCV

            
                classifier = KerasClassifier(build_fn = classifier_builder)
                parameters = {'batch_size': [128,256,512,1048], 'epochs' : [10,20,40,50,100,300],
               'optimizer': ['adam','rmsprop']}
            
        

Finally, we can use GridSearch to fit the model. The model wil output the best parameters and their cv scores for each parameter set we passed.

            
                grid_search = grid_search.fit(X_train, y_train)
            
        

After about 9 hours with tensorflow-gpu we get our best parameters:



Now we can build the final model and get a prediction.

            
            classifier = Sequential()

            #Add input layer
            classifier.add(Dense(units=6, kernel_initializer='uniform', 
                                    activation='relu', input_dim=11))
                                       
            classifier.add(Dense(units=6, kernel_initializer='uniform', 
                                    activation='relu'))
                    
            # Add output layer
            classifier.add(Dense(units=1, kernel_initializer='uniform', 
                                    activation='sigmoid'))
                    
            classifier.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics= ['accuracy'])
                                  
            #Fitting ANN to full data set with best parameters
            X_full = np.vstack([X_train, X_test])
            classifier.fit(X_full, y, batch_size=256, epochs=300)
                    
            new_prediction = classifier.predict(sc.fit_transform(np.array([[0,0,600,1,40,3,65000,2,1,1,45000]])))
            new_prediction = (new_prediction > .5)
                    
            new_prediction
            
        

It looks like this customer is not at risk of leaving...


Outcome:

It apears that we have built a pretty good predictor of customer churn. The model can immediately be applied to this bank's data to pin-point customers who might be at risk of leaving, so measures can be applied to keep them.


For a basic overview of what an artificial neural network is, see my blog post What is an Aritifical Neural Network?

Abstract

The data from kaggle contains 10,000 samples of a German bank's customers, some of which have closed their account. An artificial Neural Network was built to predict if a csutomer will leave. The final model's accuracy is around 85%. The model could be used to apply to current customers to "tag" customers that might be at risk to leave, to target with a campaign to prevent churn.

Introduction

The old business axiom of "it costs 7 times as much to get a new customer as it does to keep an existing one" is really what drives this project. If we can predict when a customer is going to leave, maybe we can take action to keep them. In this scenario, we have a bank that has account information for 10,000 customers, some of which have left the bank. Can we use the account details: CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard (Does the customer have a credit card?), IsActiveMember and EstimatedSalary to predict whether a customer will leave?


Methodology

The basic steps followed in this project:

  • I: Preprocess the data
  • II: Build initial model
  • III: Find the best model

Part I: Data Preprocessing

In this step we have basically two goals: we need to encapsulate the ordinal and categorial features and we need to standardize numeric data. We can use sklearn's LabelEncoder to encode class and ordinal features. Categories are features like color, ordinal features are features like size, and class features are features like flower name. Categorical data needs to be encoded, since their numeric representations should be meaningless. We can use sklearn's OneHotEncoder to do this after we apply the LabelEncoder.

Now that we have all of the features as numerical types, we can standardize the data. For Neural Networks this is a must-do. To standardize the data we can use sklearn's StandardScaler class to standardize the features. This makes the mean of each feature 0.

After scaling and encoding we can drop the country name categorical feature, since it has been encoded. We can then separate the data into a testing set and a training set, usually 70-80% of data is used for training, while 20% is used for testing. This is how we can easily judge the performance of our model, in addition to k-fold crossvalidation and parameter tuning, discussed in section III.

Part II: Build initial model

I always build an initial model just to see how good an algorithm might work on a particular problem and on a particular data set. This accomplishes two things. First, we want to see what sort of performance to expect from the algorithm. Two, we want to estabilish a baseline from which we can depart. In other words, when we do hyperparameter tuning we want to get an idea of how much better the model becomes as a result of tuning.

In this case, we built a network with a single hidden layer and mostly default parameters, specifying 100 for epochs (how many times the algorithm is ran) and 256 for batch size (how much data is sent per cost function weight update). The back-end of Keras is Tensorflow. Tensorflow has a GPU-based compilation tht is super fast. However, on some computer a small batch size can result in super-slow run times. Thus in training this model, and in later hyperparameter tuning, I chose batch sizes > 128.

We get some pretty good results:



Above is the confusion matrix, which shows where the classification algorithm "confused" outcomes; picking class A when it was really class B and not choosing a class when the result was of this class. We see in the top left true negative, next to false positive. Then we see false negative in bottom left and true positive in bottom left. In classification where the appearance of class labels are close to the same number, or at least a high percentage, accuracy is defined using a confusion matrix as: (TP + TN) / (num_samples), where TP = True Positive and TN = True Negative. In our case we ahve: (1548 + 136) / (2000) = .842.

We can also see from the confusion matrix that the algorithm misclassifies some customers as not being churn risks, when they are. This is something we would like to improve if we can.

Part III: Find the best model

If we are using a supervised algorithm and we have data amenable to supervised algorithms, we can use a method called hyperparameter tuning to find the best combination of model parameters. With this method, we are able to try lists of parameters and when the algorithm is done, we get the set of best parameters that yield the highest predictive ability.

How do we know a model works?

For classification, we use model accuracy as defined in the last section to define the characteristics of the "best" set of parameters and thus the best model.

We still have a couple problems though. If we use just one training and testing set, we suffer from the bias of the data, and the random selection of samples in each set. Furthermore, how can we find the best accuracy, if we don't know the best parameters for the algorithm?

Luckily, sklearn has a solution. Sklearn not only has k-fold cross-validation function, it also has a GridSearchCV class with cross-validation built in! The grid search functionality tests various combinations of parameters and records the ebst ones.

What happens in GridSearchCV is we pass in an instance of the model we are training, the data and all the parameters we want to test. The GridSearchCV class instance then takes the data and "folds" it k-times, usually ten. So we are actually testing each set of parameters ten different times. We then get the best parameters when the algorithm is finished. Brilliant!

We know the model that is the result of the parameters the GridSearchCV finds is trustworthy because it has been tested ten times for each set of parameters. For many models, this means GridSearchCV can run a thousand or more times to find the best model.