Data Structures for Search

In the previous reading, we introduced breadth-first search (BFS); when searching for a path between two nodes, BFS (in contrast to DFS) finds the shortest one, if one exists at all.

A key detail of the BFS strategy is a TODO list of work to be done (that is, nodes to be visited). A loop keeps popping nodes from the front of the list (and visiting them), then adding new nodes discovered in the process to the end of the list. The search is done when the list is empty, or a path is found. Our prior use of lists for the TODO list was inneficient; in this reading, we'll think more broadly about TODO lists and how to use efficient alternatives to lists.

A data structure to which we only append to the end and from which we only remove items from the front is called a queue, or sometimes FIFO (First-In First-Out) queue. There are three patterns with special names that are worth remembering:

queue

stack

priority queue

Fun fact: if the queue in the BFS algorithm were replaced with a stack, it wouldn't be BFS anymore; perhaps surprisingly, the resulting search algorithm would be a non-recursive DFS search.

It's often a bad idea, but Python lists case be used with all three patterns:

Given your knowledge of what list operations are fast (constant time) and slow (for example, linear time or worse), which of the above patterns will be efficient?

Only the stack pattern is efficient, as making changes (pops/appends) at the end of a list is fast. The queue pattern is slow: popping from the front is an O(N) operation. The priority queue pattern is even worse: sorting is generally an O(N log N) operation.

We'll now discuss more efficient options for queues and priority queues.

Queue Pattern and deque

The Python deque (double-edged queue) data structure is efficient when used as either a stack or a queue. You can easily convert a list to a deque, and you can append to a deque as you would a list. One minor difference: instead of L.pop(0), we'll use dq.popleft().

We can use the deque type instead of lists to make the TODO queue for our BFS implementation more efficient. The following shows our BFS code from last reading, with just three changes (indicated in the comments) necessary to switch from using a list to using a deque:

Same result, but it will be faster for large graphs.

Priority Queue Pattern and heapq

A ("A-start") is an important AI search algorithm: https://en.wikipedia.org/wiki/A_search_algorithm that we won't learn in CS 320 (take CS 540 if you're interested).

Instead of using a queue to decide the order in which to visit nodes, a priority queue is used.

To see why we would want this, let's look at a motivating case. Consider this graph:

There are three paths from A to E:

  1. A, B, C, E
  2. A, E
  3. A, D, E

Which path will DFS choose? Which will BFS choose?

ANSWER DFS will choose (1), BFS will choose (2)

Now what if we're trying to navitage between to points, and some roads/edges are longer than others:

Which path do you prefer? Unless you like the scenic route, presumably A,D,E -- the one than neither BFS or DFS chooses.

We need to base the order we explore nodes on distance. Rather than popping from the front or end of todo, we want to pop closest first. A priority queue is a structure that lets us do this.

Python's heapq functions can accomplish this: https://docs.python.org/3/library/heapq.html. It doesn't need a special structure; it can cleverly arrange items in an underlying list to do this.

Let's try it:

See how the smallest item remaining is always what you get with heappop?

You could logically accomplish the same by sorting the list each time before you want to do .pop(0), but that would be WAY slower.

With the heapq implementation of priority queue, adding and removing items has O(log N) complexity (https://en.wikipedia.org/wiki/Binary_heap#Summary_of_running_times). In contrast, sorting before each pop would have complexity O(N log N). Ouch!

Conclusion

There are many ways to techniques for searching through a graph for a value or a path that vary in terms of which nodes are explored first and the extra data structures needed to keep track of work to be done. This simplest is probably recursive DFS (depth-first search); that one is also very memory efficient for graphs that branch out quickly but are not very deep.

Unfortunately, DFS isn't good at finding short paths. BFS (breadth-first search) is guaranteed to find the shorter path. On the other hand, it tends to be harder to implement, and the data structure used to keep track of work to be done can require a lot of memory.

We also learned about three patterns for keeping track of nodes to explore, and efficient data structures for each pattern: