1
DSCI 4520/5240 Dr. Nick Evangelopoulos
PRACTICE FOR EXAM 1
PART I – MULTIPLE CHOICE QUESTIONS
1. Data mining problems often involve the prediction of a target event shared by a very small fraction of the entire population. Suppose such a proportion is equal to 1%. Then, in order to obtain a sizeable sample corresponding to that target event we sample 1,000 observations where the target event occurred and 3,000 observations where the target event did not occur.
This sampling technique is called
A. Stratified sampling C. Random sampling
B. Separate sampling or oversampling D. Independent sampling
2. Refer to question 1. After fitting a default regression model and a step wise regression model,
you create a Cumulative %Response chart, without specifying any prior probability. The %Response value of the baseline at the 20th percentile would be;
A. Equal to the non-cumulative %Response value at the 10th percentile, plus the noncumulative
%Response value at the 20th percentile
B. Equal to 1%
C. Equal to 25%
D. Equal to 33%
3. If we could come up with a “perfect” (“exact”) model, i.e., a model that always assigns a probability of 1 to cases that actually have the target event, and probability 0 to cases that do not have the event, a %Response chart with 10 bins would show;
A. Always 100% in the first bin C. Always 0% in the first bin
B. Always 50% in the first bin D. None of the above
4. A step in CRISP-DM process that relates to applying sound predictive models to business operations is called
A. Business understanding C. Evaluation
B. Modeling D. Deployment
5. In SEMMA methodology, comparing different competing models is part of:
A. Sample C. Model
B. Modify D. Assess
6. In Neural Network training algorithms, convergence refers to:
A. Selection of the final model C. Fitting two similar models
B. Reaching a model that cannot be improved D. The algorithm’s timeout
7. The estimated probability threshold after which you should decide Target = 1, is called:
A. Bayes’ optimal threshold C. Maximum probability
B. Naïve Bayes’ threshold D. Probability limit
2
PART II –PROBLEMS:
PROBLEM 1
The data below describe different weather conditions and a corresponding decision on weather to go out and Play (Weather data). Grow the first level of a decision tree (i.e., indicate the root, the first split, and the Play values (target values) in the groups of observations generated by the split), implementing the 1R algorithm. Show (i) your work and (ii) the resulting tree.
Outlook Temp. Humidity Windy Play
Sunny Hot High FALSE No
Sunny Hot High TRUE No
Overcast Hot High FALSE Yes
Rainy Mild High FALSE Yes
Rainy Cool Normal FALSE Yes
Rainy Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Sunny Mild High FALSE No
Sunny Cool Normal FALSE Yes
Rainy Mild Normal FALSE Yes
Sunny Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Rainy Mild High TRUE No
3
PROBLEM 2
Refer to Problem 1, Weather data. Grow another decision tree using Information Gain as the attribute selection criterion. Show the first level only, i.e., the tree’s root, the first split, and the resulting groups of Play values. Show (i) your work and (ii) the resulting tree. The table below may help you with some of your plog2p calculations.
p -plog2p
0 0
1 0
1/2 0.5
1/3 0.528321
2/3 0.389975
1/4 0.5
3/4 0.311278
1/5 0.464386
2/5 0.528771
3/5 0.442179
4/5 0.257542
1/6 0.430827
5/6 0.219195
4
MULTIPLE CHOICE KEY: 1-B, 2-C, 3-D, 4-D, 5-D,
6-B, 7-A