We have studied about several data structures up to this point (lists, dictionaries, tuples), so now is the time to learn how to best choose one of them to solve a problem at hand.
Given the same input a program will generate the same output every time. Because of this behavior we say the program is deterministic.
Sometimes however we would like to use nondeterministic values, i.e. pseudorandom. We say pseudo, since they are based on deterministic computations. However, they do appear to be random to the user.
The random
module provides functions that generate pseudorandom numbers or simply random numbers for this discussion.
The function .random()
returns a number in the closed-open range [0.0, 1.0)
, i.e. >= 0.0
but < 1.0
.
import random
for _ in range(5):
print(random.random())
0.4026686653499624 0.491781417388341 0.43643160454313823 0.6269806578535505 0.8487585900005747
The function .randint()
takes two parameters, low
and high
and returns an integer in the closed range [low, high]
.
for _ in range(5):
print(random.randint(1,5))
4 1 5 4 2
To choose an element at random from a sequence, you can use the .choice()
function.
l = [1, 2, 3, 4, 5]
for _ in range(5):
print(random.choice(l))
5 1 3 1 1
If you want to select a number of unique elements from a sequence, then you can use .sample()
. The second argument specifies how many elements to select from the sample.
print(random.sample(l, 2))
[3, 4]
Here's another example to consider.
Write a function chooseFromHistogram()
that takes a histogram as a parameter and returns a random value from the histogram, chosen with probability in proportion to frequency.
For example if the histogram was: {'a': 2, 'b': 1}
, then 'a'
should be selected 2/3
of the time and 'b'
selected 1/3
of the time.
Hint: You may want to create a list containing each key frequency number of times, i.e. 2 'a's
and 1 'b'
.
Assume you want to find out how many words from the book, emma.txt are not in the word list words.txt. How would you go about it?
If we consider each a set, then this becomes a set subtraction, i.e. all the words in set1 (emma), not in set2 (words).
Let's see the code for this then, first using dictionaries, then using sets.
import string
def dictSubtract(d1, d2):
# The dictionary holding the words in d1 not in d2.
d = {}
for word in d1:
if word not in d2:
# Add the word, but don't care about the value.
d[word] = None
return d
def processSubractFile(filename):
'''Return a histogram of the words in the filename.'''
hist = {}
with open(filename) as fin:
for line in fin:
processSubtractLine(line, hist)
return hist
def processSubtractLine(line, hist):
'''Removes all non-letters from line and add to the histogram.'''
line = line.replace('-', ' ')
for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
hist[word] = 1 + hist.get(word, 0)
# Use emma.txt and words.txt.
bookWords = processSubractFile('emma.txt')
listWords = processSubractFile('words.txt')
d = dictSubtract(bookWords, listWords)
l = list(d)
print(f'Found {len(d)} words. Here is a sample:')
for word in l[:10]:
print(word)
Found 587 words. Here is a sample: emma woodhouse a sister's remembrance taylor mr woodhouse's taylor's emma's
import string
def processSetFile(filename):
wordSet = set()
with open(filename) as fin:
for line in fin:
processSetLine(line, wordSet)
return wordSet
def processSetLine(line, wordSet):
line = line.replace('-', ' ')
for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
wordSet.add(word)
# Use emma.txt and words.txt.
bookWords = processSetFile('emma.txt')
listWords = processSetFile('words.txt')
s = bookWords - listWords
print(f'Found {len(s)} words. Here is a sample:')
# Here's a small sample.
l = list(s)
for word in l[:10]:
print(word)
Found 587 words. Here is a sample: humourist recollect tuesday complimenter unseasonableness hetty november improvidently smallridge
import random
def chooseFromHistogram(h):
l = []
for word, frequency in h.items():
l.extend([word] * frequency)
return random.choice(l)
d = {}
h = {'a': 2, 'b': 1}
for _ in range(10000):
letter = chooseFromHistogram(h)
d[letter] = 1 + d.get(letter, 0)
print(d)
{'b': 3333, 'a': 6667}
Create a separate Python source file (.py) in VSC to complete each exercise.
Write a program that:
Hint: The string
module provides a string named whitespace
, which contains space, tab, newline, etc., and punctuation
which contains the punctuation characters. These will be helpful in your solution.
Also, you might consider using the string methods strip()
, replace()
and translate()
.
Compare your solution to those provided furtheer down in this discussion.
A sample file to use if you'd like is convert.txt
Go to Project Gutenberg and download your favorite out-of-copyright book in plain text format.
Modify your program from the previous exercise to:
Then modify the program to:
Print the number of different words used in the book. Compare different books by different authors, writers in different eras. Which author uses the most extensive vocabulary?
Here is a copy of Emma if you want to try it out. Note: I removed all leading/trailing header lines.
Modify the program from the previous exercise to print the 20 most frequently used words in the book.