1

EMIS532 Business Data Mining

Assignment 3

PVA Fundraising

Background

A national veteran’s organization wishes to develop a data mining model to improve the cost

effectiveness of their direct marketing campaign. The organization, with its in-house database of

over 13 million donors, is one of the largest direct mail fundraisers in the United States.

According to their recent mailing records, the overall response rate is 5.1%. Out of those who

responded (donated), the average donation is $13.00. Each mailing, which includes a gift of

personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and

send. Using these facts, we take a sample of this dataset to develop a classification model that

can effectively capture donors so that the expected net profit is maximized. Weighted sampling

is used, under-representing the non-responders so that the sample has equal numbers of donors

and non-donors.

Data

The file pvaBalancedTrg.csv contains 7824 data points. The sample has been balanced to carry

equal proportions of donors and non-donors, i.e. the data has 50% donors (TARGET-B = 1) and

50% non-donors (TARGET-B = 0). The amount of donation (TARGET-D) is also included but

is not to be used in this case. The file contains 41 attributes.

The assignment

In this assignment, we examine the performance of different modeling techniques –logistic

regression, decision trees, and k-nearest neighbor classifiers. Performance should be evaluated

based on costs and benefits as given above; we want a model that maximizes profit.

PART ONE

1. Data exploration

Import the data, and examine the different variables. You are required to go through the

following aspects and explain what you did in each:

• Data statistics: distribution of values, mean, standard deviation, range of values,

frequencies, modes, etc.

• Missing values: the techniques you used to handle missing values

• Data transformation: changing data from one type to another, rescaling, data mapping,

and Principal Components Analysis (PCA); in other words, data reduction

2

• Data visualization: visually examining the relations between variables and the target

variable, i.e. scatterplots, histograms, bar charts, pie charts, etc.

• Variable selection: omitting variables, subsets, and list of final variables to consider for

modeling

2. Modeling

Partitioning – Partition the dataset into 60% training and 40% validation (set the seed to 12345).

[A specified seed ensures that we obtain the same random partitioning every time we run it. With

no specified seed, the system clock is typically used to set the seed, and a different partitioning

can result in different runs].

Modeling – create decision tree models to predict whether or not an individual is going to donate

for the fund raising. Try different models until you reach to one or two models that you think are

best. Describe these best models, and state how you decided they are the best.

PART TWO

3. Modeling (continued)

Consider the following classification techniques on the data:

• Naïve Bayes

• K-nearest neighbors

• Logistic Regression

• Neural Network (if covered)

• Others that may apply

Be sure to test different parameter values for each method, as you see suitable. What parameter

values do you try for the different techniques, and what do you find to work best? Provide a

comparative evaluation of performance of your best models from each technique. Feel free to use

the following template table to show performance comparisons:

Model# | Technique | Parameters (Configurations) |
Training Accuracy |
Testing Accuracy |
Difference |

1 | DT (1) | ||||

2 | DT (2) | ||||

… | … | ||||

6 | KNN (1) | ||||

7 | KNN (2) | ||||

… | … | ||||

12 | NB | ||||

13 | LR |

4. Classification under asymmetric response and cost

3

What is the reasoning behind using weighted sampling to produce a training set with equal

numbers of donors and non-donors? Why not use a simple random sample from the original

dataset? Given the actual response rate of 5.1%, how will the classification models behave under

simple sampling? In this case, is classification accuracy a good performance metric for our

purposes of maximizing net profit? If not, how would you determine the best model? Explain

your reasoning.

5. Calculate Net Profit

• For each method in Question 2 & 3 (choose the ‘best’ model for each method/technique),

calculate the net profit for both the training and validation set based on the actual

response rate (5.1%). Again, the expected donation, given that they are donors, is $13.00,

and the total cost of each mailing is $0.68. (Hint: to calculate estimated net profit, we will

need to “undo” the effects of the weighted sampling, and calculate the net profit that

would reflect the actual response distribution of 5.1% donors and 94.9% non-donors.)

• Draw each model’s net profit lift curve for the validation set onto a single graph. Are

there any models that dominate?

6. Best Model

• Summarize the performance of the ‘best’ model from each method, in terms of Net Profit

from predicting donors in the validation dataset; at what cutoff is the best performance

obtained? You may use the following table as template to assess the models:

Classification Technique |
Training Accuracy |
Testing Accuracy |
Net Profit | Cutoff Value |

Decision Trees | ||||

K-Nearest Neighbor |
||||

Naïve Bayes | ||||

Logistic Regression |

• From your answers above, what do you think is the “best” model? (What criteria do you

use to determine ‘best’?)

7. Testing your final model

The file FutureFundraising.xls contains the attributes for future mailing candidates. Using your

“best” model (Q6), apply the model on the data and predict the class (donor or non-donor)?

4

Don’t forget to use the same cutoff you determined in your best model to predict donor/nondonor?

Submit this file (in XLSX or XLS format), with your best model’s predictions (probabilities of

being a donor). Make sure that the output file contains the following attributes:

• CONTROLN

• Confidence (1)

• Confidence (0)

• Target_B (prediction)

Further, calculate the total profit of your model on this unseen data. Use the original cost/benefit

figures to do the calculation. Then, write short paragraphs on the following:

• Conclude on the performance of your best model in terms of campaign’s cost/benefit. Did

your analysis turn to be beneficial? What went right or wrong?

• Comment on the applicability of the technique you used. How would it relate to the

structure of the data?

• What are some other practical aspects that may increase the applicability of the model, or

aspects that may affect your analysis, in a positive or negative way?

Submit the following:

• Your answers to the above questions in Word document

• One Excel document after applying the model in Question 7

5

METADATA

STATE State abbreviation (a nominal/symbolic field)

DOMAIN DOMAIN/Cluster code. A nominal or symbolic field.

could be broken down by bytes as explained below.

1st byte = Urbanicity level of the donor’s neighborhood

U=Urban

C=City

S=Suburban

T=Town

R=Rural

2nd byte = Socio-Economic status of the neighborhood

1 = Highest SES

2 = Average SES

3 = Lowest SES (except for Urban communities, where

1 = Highest SES, 2 = Above average SES,

3 = Below average SES, 4 = Lowest SES.)

AGE Overlay Age

0 = missing

HOMEOWNR Home Owner Flag

H = Home owner

U = Unknown

NUMCHLD NUMBER OF CHILDREN

INCOME HOUSEHOLD INCOME

GENDER Gender

M = Male

F = Female

U = Unknown

J = Joint Account, unknown gender

WEALTH1 Wealth Rating

HIT Indicates total number of known times the donor has responded to a mail

order offer other than PVA’s.

IC1 Median Household Income in hundreds

IC2 Median Family Income in hundreds

IC3 Average Household Income in hundreds

IC4 Average Family Income in hundreds

IC5 Per Capita Income

IC6 Percent Households w/ Income < $15,000

IC7 Percent Households w/ Income $15,000 – $24,999

IC8 Percent Households w/ Income $25,000 – $34,999

IC9 Percent Households w/ Income $35,000 – $49,999

IC10 Percent Households w/ Income $50,000 – $74,999

IC11 Percent Households w/ Income $75,000 – $99,999

IC12 Percent Households w/ Income $100,000 – $124,999

IC13 Percent Households w/ Income $125,000 – $149,999

IC14 Percent Households w/ Income >= $150,000

6

CARDPROM | Lifetime number of card promotions received to date. Card promotions are promotion type FS, GK, TK, SK, NK, XK, UF, UU. Lifetime number of promotions received to date |

NUMPROM | |

RAMNTALL | Dollar amount of lifetime gifts to date |

NGIFTALL | Number of lifetime gifts to date |

CARDGIFT | Number of lifetime gifts to card promotions to date |

MINRAMNT | Dollar amount of smallest gift to date |

MINRDATE | Date associated with the smallest gift to date |

MAXRAMNT | Dollar amount of largest gift to date |

MAXRDATE | Date associated with the largest gift to date |

LASTGIFT | Dollar amount of most recent gift |

LASTDATE | Date associated with the most recent gift |

FISTDATE | Date of first gift |

NEXTDATE | Date of second gift |

TIMELAG | Number of months between first and second gift |

AVGGIFT | Average dollar amount of gifts to date |

CONTROLN TARGET_B TARGET_B |
Control number (unique record identifier) Target Variable: Binary Indicator for Response to 97NK Mailing Target Variable: Donation Amount (in $) associated with the Response to 97NK Mailing |