How To Create a Regex For Text Analysis With Examples Assignment

Assignment Task Problem

Write regular expressions for the following cases:

  1. Match one or more words (only with lowercase alphabets) separated by spaces e.g., "red blue green white"
  2. Match title case sentences, assuming all words are capitalized. Note that the text can contain numbers and punctuation. e.g., "Why Sleep Is So Important To Your Health?"
  3. Match strings that contain the word "ice" without matching the words that contain "ice" such as "ice cream" or "ice bucket"
  4. Textbook Exercises 2.1-(2) match the set of all lowercase alphabetic strings ending in a "b"
  5. Textbook Exercises 2.2-(2) match all strings that start at the beginning of the line with an integer and that end at the end of the line with a word

Problem

We will only use the language data in the "messages" field in the JSON format.

First, read the dataset (train.jsonl) and extract only the "messages" using a JSON parser. Write a new file ("data.txt") that contains messages in each line. We will use data.txt from now on.

Use the NLTK package to split the data into sentences. Use the sent_tokenize() function.

  1. How many sentences are there? Note that you need to ignore empty lines and empty sentences.
  2. Now let`s find words. First, split the sentences using the Python split(` `) function.
  3. How many tokens are there? This time, use NLTK`s word_tokenize() function to split into words.
  4. How many tokens are there now? Lowercase all the words.
  5. How many tokens and types now? Compare the number of tokens from (b), (c), and (d). Why are they different?
    Lastly, make a dictionary of word type counts. The dictionary contains word type as its key, and frequency as its value. Sort the dictionary.
  6. What is the most frequent word type?
  7. What is the 10th most frequent word type?

Problem

For this problem, we will use Python 3

First, we are going to download the Brown corpus using NLTK

From nltk.corpus import brown

news_data = brown.sents (categories=`news`)

romance_data = brown.sents (categories=`romance`)

Note that the texts are already split into sentences and also are tokenized.

Write a program to compute unsmoothed unigram and bigram models. Before doing that, you need to lowercase all the words. In addition, you need to include and before and after each sentence. You can use the nltk.util package, but don`t use the nltk.1m package means you need to write your function to create vocabularies, calculate MLE, etc.

Run your program on the news data and the romance data. Now compare the statistics of the two corpora.

  1. How many non-zero unigrams (in terms of counts) did you get for each corpus?
  2. How many non-zero bigrams (in terms of counts) did you get for each corpus?
  3. List the 10 most common unigrams (in terms of counts) from each dataset with their probabilities P(wt) (using MLE). You can create a table to show the numbers. Are there any interesting differences between the two?
  4. List the 10 most common bigrams (in terms of counts) from each dataset with their probabilities P(wt/wt-1) (using MLE). You can create a table to show the numbers. Are there any interesting differences between the two?
    Write a function to compute the probability of a given sentence using an n-gram model you built above.
  5. What is the probability for “I loved her when she laughed" when using the Bigram model from the news data?
  6. What is the probability for "I loved her when she laughed" when using the Bigram model from the romance data?
    Add an option to your program to do add-one smoothing.
  7. After applying add-one smoothing to your bigram models, what are the probabilities for "I loved her when she laughed" when using the model from the news data and the model from the romance data, respectively?