An Overview of this Assignment
This assignment focuses on Weka and Benford’s Law. Before you work on this assignment, please:
- Read the lecture content in Week 8 (Decision Tree), Week 9 (Weka and neural network), and Week 10 (Benford’s Law).
- Read the doc (“Using Excel to Conduct a Benford’s Law Analysis”) under Week 10.
Include your answers to the three questions in one single MS Word file. In this assignment, you can choose THREE out of FOUR questions. You are not required to answer all questions. Enjoy.
Question 1: Association Rules and Decision Tree
The HR Department has collected data related to employee turnover. It attempts to identify association rules to find out what type of people are likely to resign. The data is as follows:
Variables | Category | Count | Stay | Resign |
Salary
|
Low | 50 | 25 | 25 |
Average | 90 | 10 | 80 | |
High | 160 | 5 | 155 | |
With a Master’s Degree | Yes | 85 | 65 | 20 |
No | 215 | 30 | 185 | |
Interpersonal Relationship
|
Bad | 120 | 35 | 85 |
Average | 85 | 30 | 55 | |
Good | 95 | 3 | 92 | |
Promotion since 2017 | Yes | 120 | 30 | 90 |
No | 180 | 40 | 140 |
(1a) Copy the above table to your Microsoft Word file. Add two columns—one column Pr(Resign), and the other column Pr(Stay). The first new column contains the probability, and the second new column contains the probability of STAY.
(1b) Calculate the entropy for each of the above variables. Which of the above factors (is the most informative to tell the Human Resource Manager the stay-or-go decision by employees? Provide your justification. Show the steps of your calculation (if any).
(1c) The Human Resource Manager developed an association rule:
“If the employee does not have a master’s degree, then the employee will resign.”
What is the confidence of the above rule? What is the support?
Question 2: Weka and Decision Tree
When you save the data files, use lowercase for file extensions, e.g. use “.arff” not “.ARFF”. Otherwise, the software packages cannot recognize the files.
This ARFF file is similar to the dataset provided in the decision tree example that we did in Lecture 9. You are asked to develop a tree to use iris length and width to predict the type of iris. The data file indicates the independent and dependent variables. The last variable ‘class’ is the dependent variable, and the rest are independent variables.
Use Decision Tree (Weka) to analyze this dataset. Specifically, build a decision tree with min FIVE instances (i.e. cases) in each leaf. Also, please enable sub-tree raising and pruning (i.e. for “unpruning”, please put FALSE).
Provide the following in your answers:
- Include a screen capture of the tree output (i.e. please use the ‘visualize tree structure’ function to display the tree and screen capture it to your answer).
- Give three examples of association rules identified in your tree.
- Write down the overall predictive accuracy of your decision tree.
Note 1: Some students referenced to the assignment answers in the previous semesters, and used a wrong dataset. Those students will receive zero mark for this question.
Question 3: Weka and Neural Network
When you save the data files, use lowercase for file extensions, e.g. use “.arff” not “.ARFF”. Otherwise, the software packages cannot recognize the files.
The ARFF file indicates the independent and dependent variables. Use Neural Network (Weka) to analyze this dataset. Specifically, build a neural network with TWO hidden layers, and in the first hidden layer, there are TWO neurons, and in the second hidden layer, there are THREE neurons.
In your answer, include a screen capture of the Weka output. Then draw a neural network with the coefficients on the connection links to show how the resultant neural network model looks like.
Note 1: One simple way to draw a neural network diagram is to use CIRCLE and LINES in Microsoft Word. Some students find it easier to draw on Microsoft Powerpnt and screen capture the diagram and include it in a Word document.
Note 2: To increase readability, use 2 decimal places.
Note 3: Some students referenced to the assignment answers in the previous semesters, and used a wrong dataset or created a network with a wrong structure. Those students will receive zero mark for this question.
Question 4: Benford’s Law
This file contains only one column—the amounts of 2,086 reimbursements.
In this company, employees do not need to seek approval to get a reimbursement with an amount less than $Y. And they have to provide receipts and write a justification report to the Head of Dept to get a reimbursement with an amount greater than or equal to $Y. Your tasks contain two steps:
(4a) Use Benford’s Law to plot a chart that shows the distributions of the first two leading digits of all reimbursement amount.
(4b) Identify the anomaly (if any).
(4c) To find out the exact value of $Y
To answer this question, you are expected to use MS Excel to generate one single chart that includes the distribution of 2,086 reimbursements and the correct Benford’s law distribution for two leading digits.