Instructions:

▪ Submit only one file in pdf format to the link on the Study Desk.

▪ Assume that your report will be read by someone familiar with the data sets but with limited statistical knowledge. Fully explain plots and when stating statistics or results explain what they mean statistically AND in context of the data.

▪ Presentation should be neat, consistent, spell-checked and proof read. All questions should be clearly labelled, and all answers should clearly and concisely address the questions.

▪ If you convert a Word document to pdf for submission check that all symbols, equations etc. have converted correctly, i.e., proof-read your work.

▪ If you do not use knitr to compile your submission, where asked to provide R code, paste relevant code within the assignment document and italicise (or otherwise highlight or distinguish from other content). Do not include code in an appendix.

▪ Do not include an appendix at all. Any work included in an appendix will not be marked.

▪ Please note that referencing text books and other resources is not the goal of this assessment. This work requires students to demonstrate their understanding of the analysis and interpretation, not provide quotes from resources.

▪ When interpreting output, you are expected to do so in context of the data and the method (i.e. ensure you comment on aspects of the method that affect your interpretation with the respect to the variables and sample).

▪ A maximum of 10 marks will be deducted from your total marks for poor presentation.

Marks:

▪ Question 1: 25

▪ Question 2: 30

▪ Question 3: 20

▪ Question 4: 20

▪ Question 5: 5

Page 2 of 4

Question 1 (25 marks):

The data file ‘iris.txt’ contains data measuring four features of iris flowers. One hundred plants across three species were measured for the variables Sepal Length, Sepal Width, Petal Length and Petal Width. Provide R code, output and written interpretation for all analyses.

(a) Produce and interpret pair-wise scatter plots for all four of the flower features variables, distinguishing between species using colour. (5 marks)

(b) Training and test sets should be used with a 60/40 split and a seed value of 1125 in your code. Use the table function in R to provide the number of flowers in each species for both the training and test sets that you have constructed. (3 marks)

(c) How would increasing the training/split to 80/20 potentially affect your results? (do not perform this analysis) (2 marks)

(d) Perform a DFA using the training set. Explain why there are only two DFs calculated. Provide output, definition and interpretation (in context of the data and method) for: (10 marks)

• the prior probabilities

• the trace values

• the weightings on LD1 and LD2

(e) Based on the DFA, predict species membership for the test set and create and interpret a table showing observed vs predicted for the test set. Create an x-y plot of the two DFs grouped by the original species labels and another by the predicted species labels. Indicate on the 2nd plot the flowers that were misclassified. (5 marks)

Question 2 (30 marks):

The data file ‘butterflies.txt’ contains the butterfly data from Table 1.3 in the text book by Manly (2005). Sixteen colonies of butterflies were sampled and the data set contains information related to 4 environmental variables and 4 gene frequencies. As described in Manly (2005) the frequencies for the 0.40 and 0.60 genes have been combined to form a new variable labelled ‘0.4+0.6’. Assume MVN.

(a) Based on standardised variables produce and comment on 3 separate pairwise correlation matrices: 1) correlation between the 4 gene frequency variables; 2) correlation between the 4 environmental variables; 3) correlation between the 4 gene frequency variables and the 4 environmental variables. Do these correlation matrices suggest that canonical correlation would be an appropriate form of analysis and why? (5 marks)

Page 3 of 4

(b) Perform a canonical correlation on this data set for the standardised variables X1 to X4 (Alt, annualprec, maxtemp, mintemp) and Y1 to Y4 (X0.4_0.6, X0.80, X1.00, X1.16) as defined on page 149 of Manly (2005). Provide appropriate output, definitions and interpretations for: (12 marks)

• canonical correlations (also explain why canonical correlations become successively weaker but do not add up to one).

• chi-square test of significance and Rao’s F approximation significance test

• redundancy coefficients for the variance in the Y set of variables explained by the variance in the X set.

[Note: ‘appropriate’ requires you to select the appropriate parts of the output from your analysis to address each dot-point – do not include all R output].

(c) Provide the equations that describe the first canonical function using your analysis solution. Interpret the canonical loadings and the value of the analysis overall. (5 marks)

(d) Provide the output showing the eigen values and interpret. Explain the relationship between eigen values and canonical correlations. (3 marks)

(e) Why is canonical correlation an appropriate technique for this analysis and not multiple regression or MANOVA? (2 marks)

(f) What are the limitations associated with canonical correlation analysis? (3 marks)

Question 3 (20 marks):

Use the ‘iris_sub.txt’ data file (Caution: not the same file as used in Question 1). Provide R code, output and written interpretation for all analyses.

(a) In R produce a table of sample sizes per species in the dataset. Comment. (3 marks)

(b) Standardise the data and perform a cluster analysis based on Euclidian distances and Nearest-Neighbour linkage. Plot a dendrogram based on this cluster analysis (label the tips of the dendrogram branches by species). Indicate on the dendrogram where the tree should be cut to produce 3 clusters and describe the cluster membership. (7 marks)

(c) Repeat the analysis in part (b) using Euclidian distance and group average linkage, and then again using Manhattan distance and group average linkage. For the clustering based on Manhattan produce the cutree group membership for 3 clusters. Discuss which cluster analysis produces the ‘best’ result, specifically commenting on: the choice of 3 clusters and the membership of each cluster given the true species designation. (10 marks)

Page 4 of 4

Question 4 (20 marks):

Use the same ‘iris_sub.txt’ data file as in Question 4. Provide R code, output and written interpretation for all analyses.

(a) Produce a metric 2D MDS ordination plot based on Euclidian distances for the four measurement variables (SEPALLEN, SEPALWID, PETALLEN and PETALWID) and using the SPECIES number as labels in the ordination space. Include in your interpretation of the MDS ordination an interpretation of the Goodness of Fit output from the MDS analysis. What happens to the GOF if another dimension is added to the analysis? (5 marks)

(b) Reproduce the ordination plot with the species numbers also coloured by species i.e red for species 1, blue for species 2 and dark green for species 3. Hint: Use colors() to find names of colours to use in code (3 marks)

(c) Reproduce the ordination plot with the species identified by 3 different symbols of your choice. Include a legend on your plot. Hint: Use pch? to find codes for symbols (2 marks)

(d) Compare the metric MDS ordination to the cluster analyses performed in Question 3. Comment on the similarities and differences between the methods and compare the results. (4 marks)

(e) Is it possible to determine which variables are most influential on the x ordination axis? Explain. (2 marks)

(f) Rerun the ordination with row labels (plant id) as the labels for objects in the ordination space. Briefly describe the association between Plants 27, 39, 51 and 75. (4 marks)

Question 5 (5 marks)

Write 100 to 300 words explaining whether any of these forms of analysis have helped your understanding of the data. Do not restate results.