1
EMIS532 Business Data Mining
Assignment 3
PVA Fundraising
Background
A national veteran’s organization wishes to develop a data mining model to improve the cost
effectiveness of their direct marketing campaign. The organization, with its in-house database of
over 13 million donors, is one of the largest direct mail fundraisers in the United States.
According to their recent mailing records, the overall response rate is 5.1%. Out of those who
responded (donated), the average donation is $13.00. Each mailing, which includes a gift of
personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and
send. Using these facts, we take a sample of this dataset to develop a classification model that
can effectively capture donors so that the expected net profit is maximized. Weighted sampling
is used, under-representing the non-responders so that the sample has equal numbers of donors
and non-donors.
Data
The file pvaBalancedTrg.csv contains 7824 data points. The sample has been balanced to carry
equal proportions of donors and non-donors, i.e. the data has 50% donors (TARGET-B = 1) and
50% non-donors (TARGET-B = 0). The amount of donation (TARGET-D) is also included but
is not to be used in this case. The file contains 41 attributes.
The assignment
In this assignment, we examine the performance of different modeling techniques –logistic
regression, decision trees, and k-nearest neighbor classifiers. Performance should be evaluated
based on costs and benefits as given above; we want a model that maximizes profit.
PART ONE
1. Data exploration
Import the data, and examine the different variables. You are required to go through the
following aspects and explain what you did in each:
Data statistics: distribution of values, mean, standard deviation, range of values,
frequencies, modes, etc.
Missing values: the techniques you used to handle missing values
Data transformation: changing data from one type to another, rescaling, data mapping,
and Principal Components Analysis (PCA); in other words, data reduction

2
Data visualization: visually examining the relations between variables and the target
variable, i.e. scatterplots, histograms, bar charts, pie charts, etc.
Variable selection: omitting variables, subsets, and list of final variables to consider for
modeling
2. Modeling
Partitioning – Partition the dataset into 60% training and 40% validation (set the seed to 12345).
[A specified seed ensures that we obtain the same random partitioning every time we run it. With
no specified seed, the system clock is typically used to set the seed, and a different partitioning
can result in different runs].
Modeling – create decision tree models to predict whether or not an individual is going to donate
for the fund raising. Try different models until you reach to one or two models that you think are
best. Describe these best models, and state how you decided they are the best.
PART TWO
3. Modeling (continued)
Consider the following classification techniques on the data:
Naïve Bayes
K-nearest neighbors
Logistic Regression
Neural Network (if covered)
Others that may apply
Be sure to test different parameter values for each method, as you see suitable. What parameter
values do you try for the different techniques, and what do you find to work best? Provide a
comparative evaluation of performance of your best models from each technique. Feel free to use
the following template table to show performance comparisons:

Model# Technique Parameters
(Configurations)
Training
Accuracy
Testing
Accuracy
Difference
1 DT (1)
2 DT (2)
6 KNN (1)
7 KNN (2)
12 NB
13 LR

4. Classification under asymmetric response and cost
3
What is the reasoning behind using weighted sampling to produce a training set with equal
numbers of donors and non-donors? Why not use a simple random sample from the original
dataset? Given the actual response rate of 5.1%, how will the classification models behave under
simple sampling? In this case, is classification accuracy a good performance metric for our
purposes of maximizing net profit? If not, how would you determine the best model? Explain
your reasoning.
5. Calculate Net Profit
For each method in Question 2 & 3 (choose the ‘best’ model for each method/technique),
calculate the net profit for both the training and validation set based on the actual
response rate (5.1%). Again, the expected donation, given that they are donors, is $13.00,
and the total cost of each mailing is $0.68. (Hint: to calculate estimated net profit, we will
need to “undo” the effects of the weighted sampling, and calculate the net profit that
would reflect the actual response distribution of 5.1% donors and 94.9% non-donors.)
Draw each model’s net profit lift curve for the validation set onto a single graph. Are
there any models that dominate?
6. Best Model
Summarize the performance of the ‘best’ model from each method, in terms of Net Profit
from predicting donors in the validation dataset; at what cutoff is the best performance
obtained? You may use the following table as template to assess the models:

Classification
Technique
Training
Accuracy
Testing
Accuracy
Net Profit Cutoff Value
Decision Trees
K-Nearest
Neighbor
Naïve Bayes
Logistic
Regression

From your answers above, what do you think is the “best” model? (What criteria do you
use to determine ‘best’?)
7. Testing your final model
The file FutureFundraising.xls contains the attributes for future mailing candidates. Using your
best” model (Q6), apply the model on the data and predict the class (donor or non-donor)?
4
Don’t forget to use the same cutoff you determined in your best model to predict donor/nondonor?
Submit this file (in XLSX or XLS format), with your best model’s predictions (probabilities of
being a donor). Make sure that the output file contains the following attributes:
CONTROLN
Confidence (1)
Confidence (0)
Target_B (prediction)
Further, calculate the total profit of your model on this unseen data. Use the original cost/benefit
figures to do the calculation. Then, write short paragraphs on the following:
Conclude on the performance of your best model in terms of campaign’s cost/benefit. Did
your analysis turn to be beneficial? What went right or wrong?
Comment on the applicability of the technique you used. How would it relate to the
structure of the data?
What are some other practical aspects that may increase the applicability of the model, or
aspects that may affect your analysis, in a positive or negative way?
Submit the following:
Your answers to the above questions in Word document
One Excel document after applying the model in Question 7
5
METADATA
STATE State abbreviation (a nominal/symbolic field)
DOMAIN DOMAIN/Cluster code. A nominal or symbolic field.
could be broken down by bytes as explained below.
1st byte = Urbanicity level of the donor’s neighborhood
U=Urban
C=City
S=Suburban
T=Town
R=Rural
2nd byte = Socio-Economic status of the neighborhood
1 = Highest SES
2 = Average SES
3 = Lowest SES (except for Urban communities, where
1 = Highest SES, 2 = Above average SES,
3 = Below average SES, 4 = Lowest SES.)
AGE Overlay Age
0 = missing
HOMEOWNR Home Owner Flag
H = Home owner
U = Unknown
NUMCHLD NUMBER OF CHILDREN
INCOME HOUSEHOLD INCOME
GENDER Gender
M = Male
F = Female
U = Unknown
J = Joint Account, unknown gender
WEALTH1 Wealth Rating
HIT Indicates total number of known times the donor has responded to a mail
order offer other than PVA’s.
IC1 Median Household Income in hundreds
IC2 Median Family Income in hundreds
IC3 Average Household Income in hundreds
IC4 Average Family Income in hundreds
IC5 Per Capita Income
IC6 Percent Households w/ Income < $15,000
IC7 Percent Households w/ Income $15,000 – $24,999
IC8 Percent Households w/ Income $25,000 – $34,999
IC9 Percent Households w/ Income $35,000 – $49,999
IC10 Percent Households w/ Income $50,000 – $74,999
IC11 Percent Households w/ Income $75,000 – $99,999
IC12 Percent Households w/ Income $100,000 – $124,999
IC13 Percent Households w/ Income $125,000 – $149,999
IC14 Percent Households w/ Income >= $150,000

6

CARDPROM Lifetime number of card promotions received to date. Card promotions are
promotion type FS, GK, TK, SK, NK, XK, UF, UU.
Lifetime number of promotions received to date
NUMPROM
RAMNTALL Dollar amount of lifetime gifts to date
NGIFTALL Number of lifetime gifts to date
CARDGIFT Number of lifetime gifts to card promotions to date
MINRAMNT Dollar amount of smallest gift to date
MINRDATE Date associated with the smallest gift to date
MAXRAMNT Dollar amount of largest gift to date
MAXRDATE Date associated with the largest gift to date
LASTGIFT Dollar amount of most recent gift
LASTDATE Date associated with the most recent gift
FISTDATE Date of first gift
NEXTDATE Date of second gift
TIMELAG Number of months between first and second gift
AVGGIFT Average dollar amount of gifts to date
CONTROLN
TARGET_B
TARGET_B
Control number (unique record identifier)
Target Variable: Binary Indicator for Response to 97NK Mailing
Target Variable: Donation Amount (in $) associated with the Response to
97NK Mailing