Evaluate and visualize the accuracy/performance and the working solution for each method you applied.

Machine Learning on Big Data

Big Data Analytics using ML and Streaming methods

Big Data Analytics using PySpark

Develop one multi-class classifier and one clustering.
Explain the features and configurations you wish to apply.

Evaluate and visualize the accuracy/performance and the working solution for each method you applied.

Data Streaming analytics using PySpark

Complete two tasks for data streaming analytics. You should put the screenshot of the working solution in the
report.

Documentation Write down a scientific report.

Implementation Project

Task 1

Find a data set involving an interesting sequence of symbols: perhaps text, color sequences in images, or event logs from some device. Use word2vec to construct symbol embeddings from them, and explore through nearest neighbor analysis.
What interesting structures do the embeddings capture?

Task 2

Experiment with different discounting methods estimating the frequency of words in English. In particular, evaluate the degree to which frequencies on short text files (1000 words, 10,000 words, 100,000 words, and 1,000,000 words) reflect the frequencies over some large text corpora, say, 10,000,000 words.

Tip: You can use the interesting YouTube Video - Mining Big Data with Apache

SparkURL from Week 2 as the example of implementation on these types of ML modelling.

Implementation Presentation

The Presentation Part is a Good presentation based on the Report you will produce. Please follow the marking scheme so you will know how your presentation should be presented.