CSE5BDC - Big Data Computer Engineering Programming Assignment

Assignment Task: Big Data Computer Engineering Programming Assignment

Objectives

  1. Develop an in-depth understanding of big data tools such as Hive, SparkRDDs, and Spark SQL.
  2. Solve complex big data processing tasks efficiently.
  3. Gain experience processing different types of real data: a. Standard multi-attribute data (Bank data) b. Time series data (Twitter feed data) c. Bag of words data.
  4. Practice using programming APIs to find optimal solutions.

ChatGPT and Similar AI Tools

For this assessment task, using ChatGPT, AI tools, or chatbots with similar functionality is strictly prohibited. Breaching this rule will result in standard academic misconduct measures. Students may also be required to provide an oral validation of their understanding of their submitted work.

Expected Quality of Solutions

  • Code efficiency (minimal reading/writing from/into HDFS and reduced data shuffling) will be rewarded with higher marks.
  • The assignment can be completed using the provided docker containers and datasets without running out of memory.
  • Emphasis is on problem-solving rather than formatting output.
  • For Hive queries, solutions using fewer tables are preferred.

Task

1. Analysing Bank Data

Conduct analytics on real data from a Portuguese banking institution stored in a semicolon (“;”) delimited format.

2. Analysing Twitter Time Series Data

Perform analytics on real Twitter data stored in a tab (“ ”) delimited format.

a) [Spark RDD] Find the single row with the highest count and report the month, count, and hashtag name. Print the result using println.

b) [Do twice, using Hive and Spark RDD] Find the hashtag name tweeted the most across all months. Report the total number of tweets for that hashtag name.

c) [Spark RDD] Given two months x and y (where y > x), identify the hashtag name with the most increased tweet count from month x to month y.

3. Indexing Bag of Words Data

Create a partitioned index of words to documents efficiently.

4. Creating Co-occurring Words from Bag of Words Data

a) [Spark SQL] Remove rows referencing infrequent words from docwords. Store the resulting dataframe in Parquet format at frequent_docwords.parquet and in CSV format at “Task 4a-out.” An infrequent word appears less than 1000 times in the entire corpus.

WhatsApp icon