Balaram Panigrahy: Telecom Churn - Exploratory Data Analysis

Abstract:

The telecoms industry is a highly competitive sector which is constantly challenged by customer churn or attrition. In order to remain steadfast in the consumer business, companies need to have sophisticated churn management strategies that will harness valuable data for business intelligence. Data mining and machine learning are tools which can be used by telecoms companies to monitor the churn behaviour of customers.

This study implemented exploratory data analysis and feature engineering in a public domain Telecoms dataset and this study discussed how these results are essential in reducing customer churn and improving customer service.

Introduction:

Customer churn is a good indicator of service quality and customer service satisfaction. The telecommunications industry is a dynamic business sector that is primarily composed of companies operating in a subscription-based model. These companies are constantly pressured with higher rates of customers who churned and shifted to rival companies that offer competitive products and services. Thus, some of them employ measures in determining the reasons why their customers churn and seek innovative strategies to improve customer satisfaction and increase the customer base.

Customer Relationship Management (CRM) is a strategic process of managing customer relations and customer retention. Some companies mine customers’ data to better understand the behavior of their customers and gain actionable insights that help improve customer service . When machine learning is embedded in a CRM software, it can track churn rates, identify churn determinants, and pinpoint customers who are at risk of churning. It can also help a company decide and employ proactive retention strategies.

Types of Churners:

From the problem statement we conclude the following possible type of churners are there.

Voluntary Churner - Voluntary Churn occurs when a customer makes a conscious decision to cancel your services. This could be due to a variety of reasons like dissatisfaction with the service, unhappiness with the pricing, or the perceived value of your service. If your clients are cancelling in droves or you receive multiple customer complaints, it could be a sign of an underlying problem. When this kind of churn happens, it is essential to find the root cause of the churn and update your strategy as voluntary churn drives up the customer acquisition cost. It is again of two types

a) Deliberate Churn – Deliberate churn happens when a customer decides to leave. This could be because they find a better deal, product or service with a competitor or because they are unsatisfied with what your business offers. You might hear them saying

b) Incidental Churn – Incidental churn is when a customer is no longer able to remain with you. For example, they move somewhere you do not service or they no longer have the financial funds to keep purchasing what you sell. These reasons can be hard, but not impossible for a business to overcome.

Involuntary churn -

When most companies create strategies to combat churn, they mostly consider voluntary churn as it’s the one that’s easier to notice. But there is another type of churn that may drain a huge amount of revenue and can become an issue if it goes unchecked.

Involuntary churn occurs when a customer’s payment attempt fails, without them noticing. If the customer misses multiple payments their subscription isn’t renewed resulting in churn.

Reasons for customer churning:

· Price: Pricing promotions abound to entice customers to flee from one carrier to a competitor

· Service quality: Lack of connection capabilities may make a customer go with a carrier with wider network coverage

· Lack of customer service: Slow or no response to customer complaints makes a customer more likely to switch carriers

· Billing disputes

· New competitors entering the market

· Competitors introducing new products or technology

Business Problem Overview:

Using the data provided, this research aims to analyse the data to determine what variables are correlated with customer churn.
The telecom industry experiences an average of 15-25% annual churn rate and it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition,
To reduce customer churn rate telecom companies, need to predict which customers are at high risk of churn.

In this project we will analyze customer level data of a leading telecom company to predict the customer churn

Data description:

Data volume:

Since we don’t know the features that could be useful to predict the churn, we had to work on all the data that reflect the customer behavior in general. We used data sets related to calls, charges with all related information like total day, evening, night call and international calls and their charges and other. The volume of the data is (3333,20).

Unbalanced dataset:

The generated dataset was unbalanced since it is a special case of the classification problem where the distribution of a class is not usually homogeneous with other classes. the dominant class is called the basic class, and the other is called the secondary class. the data set is unbalanced if one of its categories is 10% or less compared to the other one.

We found that Orange S.A Telecom dataset is unbalanced since the percentage of the secondary class that represents churn customers is about 14% of the whole dataset.

Steps Involved:

Ø Exploratory Data Analysis

After data collection, several steps are carried out to explore the data. Goal of this step is to get an understanding of the data structure, conduct initial pre-processing, clean the data, identify patterns and inconsistencies in the data (i.e., skewness, outliers, missing values) and build and validate hypotheses. . This process helped us figuring out various aspects and relationships among the target and the independent variables

It gave us a better idea of which feature behaves in which manner compared to the target variable.

Ø Missing Values:

There is a representation of each service and product for each customer. Missing values may occur because not all customers have the same subscription. Some of them may have a number of services and others may have something different. In addition, there are some columns related to system configurations and these columns may have null values.

But in Orange S.A Telecom data set there are no null values.

Ø Univariate Analysis:

Univariate analysis looks at one feature at a time. When we analyse a feature independently, we are usually mostly interested in the distribution of its values and ignore other features in the dataset.

Below, we will consider different statistical types of features and the corresponding tools for their individual visual analysis.

In our case, the data is not balanced; that is, our two target classes, loyal and disloyal customers, are not represented equally in the dataset. Only a small part of the clients cancelled their subscription to the telecom service. As we will see in the following articles, this fact may imply some restrictions on measuring the classification performance, and, in the future, we may want to additionally penalize our model errors in predicting the minority "Churn" class

Bar plots

Histograms are best suited for looking at the distribution of numerical variables while bar plots are used for categorical features

Histograms and density plots

In the above plot, we see that the variable Total day minutes is normally distributed, while Total intl calls is prominently skewed right (its tail is longer on the right).

A histogram groups values into bins of equal value range. The shape of the histogram may contain clues about the underlying distribution type: Gaussian, exponential, etc. You can also spot any skewness in its shape when the distribution is nearly regular but has some anomalies. Knowing the distribution of the feature values becomes important when you use Machine Learning methods that assume a particular type.

Box plot

Another useful type of visualization is a box plot

We can see that a large number of international calls is quite rare in our data.

The box by itself illustrates the interquartile spread of the distribution; its length is determined by the 25th(Q1)25th(Q1) and 75th(Q3)75th(Q3) percentiles. The vertical line inside the box marks the median (50%50%) of the distribution.

The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval

(Q1−1.5⋅IQR,Q3+1.5⋅IQR)(Q1−1.5⋅IQR,Q3+1.5⋅IQR), where IQR=Q3−Q1IQR=Q3−Q1 is the interquartile range.

Ø Multivariate Analysis:

Multivariate plots allow us to see relationships between two and more different variables, all in one figure. Just as in the case of univariate plots, the specific type of visualization will depend on the types of the variables being analyzed.

Correlation matrix

Let's look at the correlations among the numerical variables in our dataset. This information is important to know as there are Machine Learning algorithms (for example, linear and logistic regression) that do not handle highly correlated input variables well.

First, we will use the method corr () on a DataFrame that calculates the correlation between each pair of features. Then, we pass the resulting correlation matrix to heatmap () from seaborn, which renders a color-coded matrix for the provided values:

Scatter plot

The scatter plot displays values of two numerical variables as Cartesian coordinates in 2D space. Scatter plots in 3D are also possible.

We get an uninteresting picture of two normally distributed variables. Also, it seems that these features are uncorrelated because the ellipse-like shape is aligned with the axes.

Balaram Panigrahy

Monday, September 27, 2021

Telecom Churn - Exploratory Data Analysis - Technical Document

Correlation matrix

No comments:

Post a Comment

Popular Posts

Author