Assignment Task: Big Data Computer Engineering Programming Assignment
Objectives
ChatGPT and Similar AI Tools
For this assessment task, using ChatGPT, AI tools, or chatbots with similar functionality is strictly prohibited. Breaching this rule will result in standard academic misconduct measures. Students may also be required to provide an oral validation of their understanding of their submitted work.
Expected Quality of Solutions
Task
1. Analysing Bank Data
Conduct analytics on real data from a Portuguese banking institution stored in a semicolon (“;”) delimited format.
2. Analysing Twitter Time Series Data
Perform analytics on real Twitter data stored in a tab (“ ”) delimited format.
a) [Spark RDD] Find the single row with the highest count and report the month, count, and hashtag name. Print the result using println.
b) [Do twice, using Hive and Spark RDD] Find the hashtag name tweeted the most across all months. Report the total number of tweets for that hashtag name.
c) [Spark RDD] Given two months x and y (where y > x), identify the hashtag name with the most increased tweet count from month x to month y.
3. Indexing Bag of Words Data
Create a partitioned index of words to documents efficiently.
4. Creating Co-occurring Words from Bag of Words Data
a) [Spark SQL] Remove rows referencing infrequent words from docwords. Store the resulting dataframe in Parquet format at frequent_docwords.parquet and in CSV format at “Task 4a-out.” An infrequent word appears less than 1000 times in the entire corpus.