EMIS532 Business Data Mining
Assignment 3
PVA Fundraising
Background
A national veteran’s organization wishes to develop a data mining model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling is used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
Data
The file pvaBalancedTrg.csv contains 7824 data points. The sample has been balanced to carry equal proportions of donors and non-donors, i.e. the data has 50% donors (TARGET−B = 1) and 50% non-donors (TARGET−B = 0). The amount of donation (TARGET−D) is also included but is not to be used in this case. The file contains 41 attributes.
The assignment
In this assignment, we examine the performance of different modeling techniques –logistic regression, decision trees, and k-nearest neighbor classifiers. Performance should be evaluated based on costs and benefits as given above; we want a model that maximizes profit.
PART ONE
1. Data exploration
Import the data, and examine the different variables. You are required to go through the following aspects and explain what you did in each:
Data statistics: distribution of values, mean, standard deviation, range of values, frequencies, modes, etc.
Missing values: the techniques you used to handle missing values
Data transformation: changing data from one type to another, rescaling, data mapping, and Principal Components Analysis (PCA); in other words, data reduction
Data visualization: visually examining the relations between variables and the target variable, i.e. scatterplots, histograms, bar charts, pie charts, etc.
Variable selection: omitting variables, subsets, and list of final variables to consider for modeling
2. Modeling
Partitioning – Partition the dataset into 60% training and 40% validation (set the seed to 12345). [A specified seed ensures that we obtain the same random partitioning every time we run it. With no specified seed, the system clock is typically used to set the seed, and a different partitioning can result in different runs].
Modeling – create decision tree models to predict whether or not an individual is going to donate for the fund raising. Try different models until you reach to one or two models that you think are best. Describe these best models, and state how you decided they are the best.
PART TWO
3. Modeling (continued)
Consider the following classification techniques on the data:
Naïve Bayes
K-nearest neighbors
Logistic Regression
Neural Network (if covered)
Others that may apply
Be sure to test different parameter values for each method, as you see suitable. What parameter values do you try for the different techniques, and what do you find to work best? Provide a comparative evaluation of performance of your best models from each technique. Feel free to use the following template table to show performance comparisons:
Model# | Technique | Parameters (Configurations) | Training Accuracy | Testing Accuracy | Difference |
1 | DT (1) | ||||
2 | DT (2) | ||||
… | … | ||||
6 | KNN (1) | ||||
7 | KNN (2) | ||||
… | … | ||||
12 | NB | ||||
13 | LR |
4. Classification under asymmetric response and cost
What is the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset? Given the actual response rate of 5.1%, how will the classification models behave under simple sampling? In this case, is classification accuracy a good performance metric for our purposes of maximizing net profit? If not, how would you determine the best model? Explain your reasoning.
5. Calculate Net Profit
For each method in Question 2 & 3 (choose the ‘best’ model for each method/technique), calculate the net profit for both the training and validation set based on the actual response rate (5.1%). Again, the expected donation, given that they are donors, is $13.00, and the total cost of each mailing is $0.68. (Hint: to calculate estimated net profit, we will need to “undo” the effects of the weighted sampling, and calculate the net profit that would reflect the actual response distribution of 5.1% donors and 94.9% non-donors.)
Draw each model’s net profit lift curve for the validation set onto a single graph. Are there any models that dominate?
6. Best Model
Summarize the performance of the ‘best’ model from each method, in terms of Net Profit from predicting donors in the validation dataset; at what cutoff is the best performance obtained? You may use the following table as template to assess the models:
Classification Technique | Training Accuracy | Testing Accuracy | Net Profit | Cutoff Value |
Decision Trees | ||||
K-Nearest Neighbor | ||||
Naïve Bayes | ||||
Logistic Regression |
From your answers above, what do you think is the “best” model? (What criteria do you use to determine ‘best’?)
7. Testing your final model
The file FutureFundraising.xls contains the attributes for future mailing candidates. Using your “best” model (Q6), apply the model on the data and predict the class (donor or non-donor)? Don’t forget to use the same cutoff you determined in your best model to predict donor/non-donor?
Submit this file (in XLSX or XLS format), with your best model’s predictions (probabilities of being a donor). Make sure that the output file contains the following attributes:
CONTROLN
Confidence (1)
Confidence (0)
Target_B (prediction)
Further, calculate the total profit of your model on this unseen data. Use the original cost/benefit figures to do the calculation. Then, write short paragraphs on the following:
Conclude on the performance of your best model in terms of campaign’s cost/benefit. Did your analysis turn to be beneficial? What went right or wrong?
Comment on the applicability of the technique you used. How would it relate to the structure of the data?
What are some other practical aspects that may increase the applicability of the model, or aspects that may affect your analysis, in a positive or negative way?
Submit the following:
Your answers to the above questions in Word document
One Excel document after applying the model in Question 7
METADATA
STATE State abbreviation (a nominal/symbolic field)
DOMAIN DOMAIN/Cluster code. A nominal or symbolic field.
could be broken down by bytes as explained below.
1st byte = Urbanicity level of the donor’s neighborhood
U=Urban
C=City
S=Suburban
T=Town
R=Rural
2nd byte = Socio-Economic status of the neighborhood
1 = Highest SES
2 = Average SES
3 = Lowest SES (except for Urban communities, where
1 = Highest SES, 2 = Above average SES,
3 = Below average SES, 4 = Lowest SES.)
AGE Overlay Age
0 = missing
HOMEOWNR Home Owner Flag
H = Home owner
U = Unknown
NUMCHLD NUMBER OF CHILDREN
INCOME HOUSEHOLD INCOME
GENDER Gender
M = Male
F = Female
U = Unknown
J = Joint Account, unknown gender
WEALTH1 Wealth Rating
HIT Indicates total number of known times the donor has responded to a mail order offer other than PVA’s.
IC1 Median Household Income in hundreds
IC2 Median Family Income in hundreds
IC3 Average Household Income in hundreds
IC4 Average Family Income in hundreds
IC5 Per Capita Income
IC6 Percent Households w/ Income < $15,000
IC7 Percent Households w/ Income $15,000 – $24,999
IC8 Percent Households w/ Income $25,000 – $34,999
IC9 Percent Households w/ Income $35,000 – $49,999
IC10 Percent Households w/ Income $50,000 – $74,999
IC11 Percent Households w/ Income $75,000 – $99,999
IC12 Percent Households w/ Income $100,000 – $124,999
IC13 Percent Households w/ Income $125,000 – $149,999
IC14 Percent Households w/ Income >= $150,000
CARDPROM Lifetime number of card promotions received to date. Card promotions are promotion type FS, GK, TK, SK, NK, XK, UF, UU.
NUMPROM Lifetime number of promotions received to date
RAMNTALL Dollar amount of lifetime gifts to date
NGIFTALL Number of lifetime gifts to date
CARDGIFT Number of lifetime gifts to card promotions to date
MINRAMNT Dollar amount of smallest gift to date
MINRDATE Date associated with the smallest gift to date
MAXRAMNT Dollar amount of largest gift to date
MAXRDATE Date associated with the largest gift to date
LASTGIFT Dollar amount of most recent gift
LASTDATE Date associated with the most recent gift
FISTDATE Date of first gift
NEXTDATE Date of second gift
TIMELAG Number of months between first and second gift
AVGGIFT Average dollar amount of gifts to date
CONTROLN Control number (unique record identifier)
TARGET_B Target Variable: Binary Indicator for Response to 97NK Mailing
TARGET_B Target Variable: Donation Amount (in $) associated with the Response to 97NK Mailing