Data Structures

We have studied about several data structures up to this point (lists, dictionaries, tuples), so now is the time to learn how to best choose one of them to solve a problem at hand.

Random Numbers

Given the same input a program will generate the same output every time. Because of this behavior we say the program is deterministic.

Sometimes however we would like to use nondeterministic values, i.e. pseudorandom. We say pseudo, since they are based on deterministic computations. However, they do appear to be random to the user.

The random module provides functions that generate pseudorandom numbers or simply random numbers for this discussion.

.random()

The function .random() returns a number in the closed-open range [0.0, 1.0), i.e. >= 0.0 but < 1.0.

.randint()

The function .randint() takes two parameters, low and high and returns an integer in the closed range [low, high].

.choice()

To choose an element at random from a sequence, you can use the .choice() function.

.sample()

If you want to select a number of unique elements from a sequence, then you can use .sample(). The second argument specifies how many elements to select from the sample.


Ex 1 - Choose from histogram

Here's another example to consider.

Write a function chooseFromHistogram() that takes a histogram as a parameter and returns a random value from the histogram, chosen with probability in proportion to frequency.

For example if the histogram was: {'a': 2, 'b': 1}, then 'a' should be selected 2/3 of the time and 'b' selected 1/3 of the time.

Hint: You may want to create a list containing each key frequency number of times, i.e. 2 'a's and 1 'b'.

Dictionary Subtraction

Assume you want to find out how many words from the book, emma.txt are not in the word list words.txt. How would you go about it?

If we consider each a set, then this becomes a set subtraction, i.e. all the words in set1 (emma), not in set2 (words).

Let's see the code for this then, first using dictionaries, then using sets.


Ex 1 - Solution - Choose from histogram

Practice problems

Create a separate Python source file (.py) in VSC to complete each exercise.


p1 - Convert to lowercase

Write a program that:

  1. reads a file
  2. breaks each line into words
  3. stips whitespace and punctuation from the words and
  4. converts them to lowercase.

Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. These will be helpful in your solution.

Also, you might consider using the string methods strip(), replace() and translate().

Compare your solution to those provided furtheer down in this discussion.

A sample file to use if you'd like is convert.txt


p2 - Word Count

Go to Project Gutenberg and download your favorite out-of-copyright book in plain text format.

Modify your program from the previous exercise to:

  1. read the book you downloaded
  2. skip over the header information at the beginning of the file, and
  3. process the rest the words as before

Then modify the program to:

  1. count the total number of words in the book, and
  2. the number of times each word is used

Print the number of different words used in the book. Compare different books by different authors, writers in different eras. Which author uses the most extensive vocabulary?

Here is a copy of Emma if you want to try it out. Note: I removed all leading/trailing header lines.


p3 - 20 Most Frequent Words

Modify the program from the previous exercise to print the 20 most frequently used words in the book.

[Tuples] [TOC] [Files]