EMIS532 Business Data Mining

Assignment 3

PVA Fundraising

Background

A national veteran’s organization wishes to develop a data mining model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling is used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Data

The file pvaBalancedTrg.csv contains 7824 data points. The sample has been balanced to carry equal proportions of donors and non-donors, i.e. the data has 50% donors (TARGET−B = 1) and 50% non-donors (TARGET−B = 0). The amount of donation (TARGET−D) is also included but is not to be used in this case. The file contains 41 attributes.

The assignment

In this assignment, we examine the performance of different modeling techniques –logistic regression, decision trees, and k-nearest neighbor classifiers. Performance should be evaluated based on costs and benefits as given above; we want a model that maximizes profit.

 

PART ONE

1. Data exploration

Import the data, and examine the different variables. You are required to go through the following aspects and explain what you did in each:

Data statistics: distribution of values, mean, standard deviation, range of values, frequencies, modes, etc.

Missing values: the techniques you used to handle missing values

Data transformation: changing data from one type to another, rescaling, data mapping, and Principal Components Analysis (PCA); in other words, data reduction

Data visualization: visually examining the relations between variables and the target variable, i.e. scatterplots, histograms, bar charts, pie charts, etc.

Variable selection: omitting variables, subsets, and list of final variables to consider for modeling

2. Modeling

Partitioning – Partition the dataset into 60% training and 40% validation (set the seed to 12345). [A specified seed ensures that we obtain the same random partitioning every time we run it. With no specified seed, the system clock is typically used to set the seed, and a different partitioning can result in different runs].

Modeling – create decision tree models to predict whether or not an individual is going to donate for the fund raising. Try different models until you reach to one or two models that you think are best. Describe these best models, and state how you decided they are the best.

 

PART TWO

3. Modeling (continued)

Consider the following classification techniques on the data:

Naïve Bayes

K-nearest neighbors

Logistic Regression

Neural Network (if covered)

Others that may apply

Be sure to test different parameter values for each method, as you see suitable. What parameter values do you try for the different techniques, and what do you find to work best? Provide a comparative evaluation of performance of your best models from each technique. Feel free to use the following template table to show performance comparisons:

Model# Technique Parameters (Configurations) Training Accuracy Testing Accuracy Difference
1 DT (1)
2 DT (2)
6 KNN (1)
7 KNN (2)
12 NB
13 LR

 

4. Classification under asymmetric response and cost

What is the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset? Given the actual response rate of 5.1%, how will the classification models behave under simple sampling? In this case, is classification accuracy a good performance metric for our purposes of maximizing net profit? If not, how would you determine the best model? Explain your reasoning.

 

5. Calculate Net Profit

For each method in Question 2 & 3 (choose the ‘best’ model for each method/technique), calculate the net profit for both the training and validation set based on the actual response rate (5.1%). Again, the expected donation, given that they are donors, is $13.00, and the total cost of each mailing is $0.68. (Hint: to calculate estimated net profit, we will need to “undo” the effects of the weighted sampling, and calculate the net profit that would reflect the actual response distribution of 5.1% donors and 94.9% non-donors.)

Draw each model’s net profit lift curve for the validation set onto a single graph. Are there any models that dominate?

 

6. Best Model

Summarize the performance of the ‘best’ model from each method, in terms of Net Profit from predicting donors in the validation dataset; at what cutoff is the best performance obtained? You may use the following table as template to assess the models:

Classification Technique Training Accuracy Testing Accuracy Net Profit Cutoff Value
Decision Trees
K-Nearest Neighbor
Naïve Bayes
Logistic Regression

 

From your answers above, what do you think is the “best” model? (What criteria do you use to determine ‘best’?)

 

 

7. Testing your final model

The file FutureFundraising.xls contains the attributes for future mailing candidates. Using your “best” model (Q6), apply the model on the data and predict the class (donor or non-donor)? Don’t forget to use the same cutoff you determined in your best model to predict donor/non-donor?

Submit this file (in XLSX or XLS format), with your best model’s predictions (probabilities of being a donor). Make sure that the output file contains the following attributes:

CONTROLN

Confidence (1)

Confidence (0)

Target_B (prediction)

Further, calculate the total profit of your model on this unseen data. Use the original cost/benefit figures to do the calculation. Then, write short paragraphs on the following:

Conclude on the performance of your best model in terms of campaign’s cost/benefit. Did your analysis turn to be beneficial? What went right or wrong?

Comment on the applicability of the technique you used. How would it relate to the structure of the data?

What are some other practical aspects that may increase the applicability of the model, or aspects that may affect your analysis, in a positive or negative way?

Submit the following:

Your answers to the above questions in Word document

One Excel document after applying the model in Question 7

 

METADATA

STATE State abbreviation (a nominal/symbolic field)

DOMAIN DOMAIN/Cluster code. A nominal or symbolic field.

could be broken down by bytes as explained below.

1st byte = Urbanicity level of the donor’s neighborhood

U=Urban

C=City

S=Suburban

T=Town

R=Rural

2nd byte = Socio-Economic status of the neighborhood

1 = Highest SES

2 = Average SES

3 = Lowest SES (except for Urban communities, where

1 = Highest SES, 2 = Above average SES,

3 = Below average SES, 4 = Lowest SES.)

AGE Overlay Age

0 = missing

HOMEOWNR Home Owner Flag

H = Home owner

U = Unknown

NUMCHLD NUMBER OF CHILDREN

INCOME HOUSEHOLD INCOME

GENDER Gender

M = Male

F = Female

U = Unknown

J = Joint Account, unknown gender

WEALTH1 Wealth Rating

HIT Indicates total number of known times the donor has responded to a mail order offer other than PVA’s.

IC1 Median Household Income in hundreds

IC2 Median Family Income in hundreds

IC3 Average Household Income in hundreds

IC4 Average Family Income in hundreds

IC5 Per Capita Income

IC6 Percent Households w/ Income < $15,000

IC7 Percent Households w/ Income $15,000 – $24,999

IC8 Percent Households w/ Income $25,000 – $34,999

IC9 Percent Households w/ Income $35,000 – $49,999

IC10 Percent Households w/ Income $50,000 – $74,999

IC11 Percent Households w/ Income $75,000 – $99,999

IC12 Percent Households w/ Income $100,000 – $124,999

IC13 Percent Households w/ Income $125,000 – $149,999

IC14 Percent Households w/ Income >= $150,000

CARDPROM Lifetime number of card promotions received to date. Card promotions are promotion type FS, GK, TK, SK, NK, XK, UF, UU.

NUMPROM Lifetime number of promotions received to date

RAMNTALL Dollar amount of lifetime gifts to date

NGIFTALL Number of lifetime gifts to date

CARDGIFT Number of lifetime gifts to card promotions to date

MINRAMNT Dollar amount of smallest gift to date

MINRDATE Date associated with the smallest gift to date

MAXRAMNT Dollar amount of largest gift to date

MAXRDATE Date associated with the largest gift to date

LASTGIFT Dollar amount of most recent gift

LASTDATE Date associated with the most recent gift

FISTDATE Date of first gift

NEXTDATE Date of second gift

TIMELAG Number of months between first and second gift

AVGGIFT Average dollar amount of gifts to date

CONTROLN Control number (unique record identifier)

TARGET_B Target Variable: Binary Indicator for Response to 97NK Mailing

TARGET_B Target Variable: Donation Amount (in $) associated with the Response to 97NK Mailing