Data Cleaning and Standardization Assignment

Assignment Task

Objective

Clean and standardize entity names (sponsoring government agencies and awarded companies) from SAM.gov (USA) and the E-Procurement Government of India databases.

  • Within this find the “entities” e. the sponsoring government. Here candidates need to only focus on two key attributes
    • the attribute that mentions the sponsoring government agency like the Department of Defense for gov or the Ministry of Road and Highway for E Procurement India i.e. the government agency that is giving the contract.
    • The supplier or the company that won the contract or was awarded the contract, so it could be Tata Steel or Bouygues Now note the bigger focus is on the company and supplier name. Say Tata Groups can have multiple entities for example Tata Steel and Tata Steel might be mentioned with the Tata Steel Europe or Tata Steel USA or other such names. The objective is to have consistent and reliable final names that you are supposed to clean up.
    • Candidate can show
  • Please do in depth research about the above sources to find the right url to find active tenders and historic tender records dataset
  • Manual ways they fixed a small sub sample of 100
  • if they can provide a basis of automation of this using LLMs or python operations and execute on his, they will get extra

Part 1: Data Cleaning and Standardization

Manually clean and standardize a subset (100 records) of entity names from the provided datasets..

Part 2: Automation Proposal and Script Development

Develop a basic automation script or method using Python and language models (OpenAI API, Llama2, etc.) to standardize entity names in the datasets. 

Part 3: Scalability and Production Readiness

Document how the proposed method can be scaled and implemented in a production environment.

Details :

Include considerations for continuous data updating and processing large volumes of data. Explain how the method adheres to data quality and standards.

Evaluation Criteria

Standards & Quality: Accuracy and consistency in the final cleaned and standardized data. Scalability: The potential of the method to handle large datasets efficiently in a production environment.

Documentation: Clarity and comprehensiveness of the documentation, including reasoning for scaling the solution.

Deliverables

Candidates should submit a Google Drive folder containing:

  1. Python Scripts: Code for data cleaning and
  2. Sample Data: Original and final cleaned datasets (100 records minimum).

Documentation

  • Detailed explanation of the methods
  • Plan for scaling the solution to a production

Additional Task Details

  • Data Sources: Use gov and the E-Procurement Government of India for sourcing data.
  • Tools and Languages: Utilize Python, OpenAI API, Llama2, or other open-source language

This IT Computer Science has been solved by our PhD Experts at UnilearnO. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK and US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics and referencing style.

Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

WhatsApp icon