Targeted Customer Evaluation

1.Identifying the data quality issues and how this may impact our analysis going forward?

Since the company didn’t define its own business definition for data quality evaluation, I will follow the general process.

To evaluate the quality of data, we can follow the Six Standard Data Quality Dimension:

  • Accuracy: The degree to which the data correctly describe the ‘real-world’ objects. Example: if a man is 30 years old, but the data is 35 years old.
  • Completeness: Whether the data fulfill the expectation of business comprehensiveness; in other words, whether you can extract the information you want from the data.
  • Uniqueness: Make sure there is no duplicate record in the dataset. Example: There are 300 students in total, but there are 350 records
  • Timeliness:Make sure that the data is recorded at the time when it occurred. No delay.
  • Validity:Data are valid if it conforms to the syntax (format, type, range) of its definition.**
  • Consistency:The data should be the same as input in other columns.Example: If a student name:’Peter Pan’ is in the class name-list, then ‘Peter Pan’in the school name-list should be the same as class name-list.

Reference: Defining Data Quality Dimensions

Problems,Procedures and Notes (Excel and Python)

load the file in Python and check the sheets in file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# load the necessary libraries
import pandas as pd
import xlrd
import xlwt
import datetime as dt
from datetime import date
import seaborn as sns
import matplotlib.pyplot as plt

# Read the file and check the sheet names
data=pd.ExcelFile('C:/Users/Administrator/Desktop/raw_data_update_final.xlsx')
names=data.sheet_names
names

# Check the null values exisit in the sheets to deteremine whether will affect our analysis.
for name in names:
null=data.parse(name).isnull().sum()
print(null)
  1. After I observe through the summary of null values, I found the missing values have little impact on our dataset because we want to foucs on the customer that have the relativly big purchase power, we only need to make sure that the customer ID and list price are not null.
  2. And I also find four unnamed columns. By checking the function, lets define those columns. Unnamed 1: Random number between 40-110 Unnamed 2: If customer has car, then use random number times 1.25, otherwise times 1. Unnamed 3: If the property valuation bigger than 8 or past 3 years bike related purchases bigger than 80, then use number from Unnamed 2 times 1.25, otherwise times 1. Unnamed 4: If customer was considered as mass customer then use number from Unnamed 3 times 0.85, otherwise times 1.
  3. Not Sure what those indicator use for, I decide to leave blank.
1
2
3
4
5
6
7
8
9
10
# Now, lets merge the other sheets with Transaction data on Customer ID
# Note: pd.merge can only merge two dataframe.
Tdata=data.parse('Transactions') # Read Transaction data
Cdata=data.parse('CustomerDemographic')
Adata=data.parse('CustomerAddress')
data2=Tdata.merge(Cdata, on= 'customer_id')
data3=data2.merge(Adata, on= 'customer_id')
## Delete all the rows tht contain the missing values and check the result
data4=data3.dropna(how='any')
data4.isnull().sum()

2. We want to evaluate the customer value to determine our target customer

Right here, we can apply the classic customer evaluation model – RFM model

  • I decide to add 12-31-2017 as comparision date for calculating the number of recency.
  • Apply percentile to determine R score
  • Adding the number of R score and M score as well as RFM score
  • Create a RFM dataframe
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Calculate the number of days 
data4['daydiff']=dt.datetime(2017,12,30)- data4['transaction_date']
# Use dt.days we can get int64 data type
data4['daydiff']=data4['daydiff'].dt.days

recency=data4.groupby('customer_id')['daydiff'].agg([('recency','min')])
frequency=data4.groupby('customer_id')['transaction_id'].agg([('frequency','count')])
spend=data4.groupby('customer_id')['list_price'].agg([('total spend','sum')])
RFM = recency.join(frequency).join(spend)
RFM.head()

# Now, we can simply assign the R scores, F scores, and M scores to each customer
according to the percentile and classify our customers to four categories.
rlabel=range(4,0,-1)
fmlabel=range(1,5)
RFM['R score']=pd.cut(RFM['recency'],[0,18,44,87,353],labels=rlabel)
RFM['F score']=pd.cut(RFM['frequency'],[0,4,5,7,14],labels=fmlabel)
RFM['M score']=pd.cut(RFM['total spend'],[60,4091,5883,8004,19072],labels=fmlabel)

### Assign the RFM value
# Right here we can add some weights to different score
# right here i got the error: “unsupported operand type(s) for *: 'Categorical' and 'float'”
# try to convert columns to float type data to solve this problem and doesn't work, So I change to for loop method.

S=[]
for i,j,k in zip(RFM['R score'],RFM['F score'],RFM['M score']):
s=i*0.5+j*0.8+k
S.append(s)

RFM['RFM score']=S
RFM.head()

## according to the RFM score distribution, we can label the customer
print(RFM['RFM score'].describe())
sns.distplot(RFM['RFM score'])

# We can categorize our customer to four categroies according to the RFM score percentile.

def rfm_level(df):
if df['RFM score'] >= 8:
return 'Can\'t Loose Them'
elif ((df['RFM score'] >= 6) and (df['RFM score'] < 8)):
return 'Loyal'
elif ((df['RFM score'] >= 4) and (df['RFM score'] < 6)):
return 'Promising'
elif ((df['RFM score'] >= 2) and (df['RFM score'] < 4)):
return 'Needs Attention'
RFM['RFM level'] = RFM.apply(rfm_level, axis=1)

# Now we have well defined dataframe and we can load in to Tableau for further data visualization.

data4.to_excel('C:/Users/Administrator\Desktop/datacustomer.xlsx')
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2020-2021 By Yang Yue
  • Visitors: | Views:

请我喝杯奶茶吧~