Development Blog

2018.11.14

Introduction to Text Processing: Let's make a fast dictionary in Python (Part 4)

[Part 1]

[No.2]

Creating a double array

[No.3]

[No.4]

Python Implementation of a Double Array (Part 3)

Up to this point, we have mainly discussed the creation of double arrays.
If you have read this article, you probably have some idea of what it is like to search for arrays.
If you still find the story of creation a bit difficult, try reading "Creating Double Arrays" again and filling in the double array table at
by moving your hands around.

I'll write about incremental search here.
In the 6 words for illustration, while searching for "donchan", the leaf node is found when you get to the second letter from the beginning, "don".
In an associative array like Pyrthon's dict, "donchan" and "don" are separate searches, but in a Trie tree containing
double arrays, you can pick up matches with substrings that match the search key and the forward match at the same time.
This is the greatest strength of the Trie tree.

When I think of a process where I have a sentence and I want to find all the strings in it that match the headword,
if I were to create a similar dictionary search in dict, I would have to create
substrings of all lengths starting at all character positions in the sentence and query dict for each one.
The longer the sentence, the more difficult it would be.
On the other hand, what about searching in the Trie tree?
You only need to give the position of each character in the sentence.
There is no need to create various length strings starting from that position as search keys.
Also, if the transition in the Trie tree is no longer possible, the process exits immediately, so there is no need to use a long and useless key for the search, as in the case of
dict.
And when you exit the process once, you will get all the headwords that match the dictionary in the substring up to that point.

Now, I will try to write a program that uses a double array, independent of the program that created it.
In this situation, jupyter-notebook makes it easy because all I have to do is move it to the cell below.

Various quantities (variables) appeared in the program that created the double array.
However, when using a double array, all you need is
the values of base and check at a certain node position.
You don't even need to bother with the base and check arrays to get the values.
Just load the file and access the binaries directly.

In the program of use, we create a class, class DArray, which has four methods

Method: __init__
Reads in the three files of the dictionary and keeps them as members.
However, the correspondence of character=>character ID should be held in a dict named c2code.

Method: base
A getter-like method that returns the base component from a node position.

Method: check
getter-like method that returns the check component from the node position.

Method: search_gen
The search for a double array is exactly the same as the creation of it and the transition of the node.
The difference is that there is no new creation of a node, and the search is terminated when no transitions are possible.
In a sense, it is a simpler algorithm than creating a double array.
This method is a generator because it is easier to write.
When calling this method, it can be converted to a list or written in the "in" section of a for statement.

The input for the method is a search key (e.g., a whole sentence)
that performs an incremental search and a list of character IDs. It is also written in pseudo code style sentences.

____. s = 1
____ one character from the left of the search key, one letter at a time c from the left of the search key; its character position is i (loop)
____ character c is the character ID is not in the list of character IDs:
＿＿＿＿ return # end of search End of search
____ character c is the character ID When in the list of
＿＿＿＿ base[s]. get
＿＿＿＿ t = base[s] + c2code[c]
＿＿＿＿ check[t]. get
＿＿＿＿ when check[t] = s when: # end of search Consistency check OK
＿＿＿＿＿＿ if base[t] < 0 when: # end of search leaf node 1
＿＿＿＿＿＿＿＿ i and -base[t]. with respect to yield to # end of search There were registered words!
＿＿＿＿＿＿ other than that:
＿＿＿＿＿＿＿＿ t2 = base[t]
＿＿＿＿＿＿＿＿ base[t2] < 0 and check[t2] == t when: # end of search leaf node 2
＿＿＿＿＿＿＿＿＿＿ i and -base[t2] with respect to yield to # end of search There were registered words!
＿＿＿＿＿＿ s = t # end of search Move to next node position
＿＿＿＿＿＿ next
＿＿＿＿ check[t] ! = s when: # end of search Consistency check NG
＿＿＿＿＿＿ return # end of search End of search

The program for the DArray class looks like this It is quite short.

［Code Example 4]

Python Implementation of a Double Array (Part 4)

Now, let's actually use the above DArray class.
I would like to search for how many hits of Wikipedia headword strings from a long Wikipedia article.
Wikipedia prohibits scraping, so let's quickly open any of the articles from the list of articles at
browser, copy and paste it, and save it as a text file.

The Long Page (Wikipedia)
https://ja.wikipedia.org/wiki/Special: Long page

I chose "History of the Separation of Church and State in Europe" and saved it as church.txt.
It is a fairly long text for one article, about less than 500 KB.

Searching for articles is a three-step process
(1) Read a line in the file,
(2) Create substrings from the string of the line, starting at each character position,
(3) Perform an incremental search for each substring.

Since we wrote the search method as a generator, let's make each step a generator here as well.
The multistage generator simply passes the results one after another as variables.
No actual computation is performed when it is passed (lazy evaluation).

Suppose that in an upstream generator, two variables are put together and flowed in the form of a tuple.
How can we use part of it in the downstream generator?
If we assume that we want to use the tuple in a downstream generator, then we must first use it in an upstream generator,

( variable1, variable2) = s variable2) = Upstream generator Result of

then the entire result of the many calls to the upstream generator would first be expanded into an array.
(Terrifying, isn't it!)
Then next, it tries to pass that array to a 2-element tuple, resulting in an element count error.

So, how to make it work is to create a for loop once in the downstream generator
and pass it in there:

for ( variable1, variable2) = s variable2) in Upstream generator Result of
　　　　variable1 and variable2 Processing that uses the

It may be faster if you actually see the program.

［Code Example 5] and [Result Example 5]

There were approximately 85,000 string matches with the dictionary in the article.
Memory usage was about 64 MB and about 2 seconds on a Core i7.
Doing the same thing with dict, an associative array, took about 20 seconds.
The dict version excludes the time required to create a dict from words.txt as a handicap.
Originally, Trie tree is a fast algorithm, but the double array is especially fast.
(↑This article is summarized in this one line)

A few final observations

I think about the issues to be improved in this program.
There is a story that "If it is going to be integrated into the system, doesn't it need more error checking?"
, but I'll leave that side aside for now.

I'll leave the "rewrite in C++"
"Treat in byte units of UTF8 instead of character unitsand use the character code values as they are"
Neither of these is difficult, so I encourage anyone interested to try them out.

Make it a Patriciatree"
Patricia tree (radix trie / compact prefix tree) is one of the devices of Trie tree.
In the description and implementation of this article, we compressed the leaf nodes when the transitions do not branch.

In the Patricia tree, we compress all nodes, regardless of their leaf nodes, that have no branches and are on a straight path.
In the six-word world we used for illustration, for example, the "de→n→end" transition has no branches and is a single path, so it can be combined into one.
That would bring the number of nodes needed to nine. Please count them.

Patricia trees have long been a very popular method of compressing Trie trees.
It is probably the most common method for double arrays as well.
However, there is a significant change when a double array is made into a Patricia tree.
That is, as a result of compressing a single array, "every node has multiple transition destinations.
This means that it will be difficult to fill the free space in the double array neatly from left to right.

When creating a double array, multiple phases cannot be processed in parallel and must always be processed serially, one step at a time.
If this were a filling problem, it would take too much time.
On the other hand, if you have to search for free space from the position of the left-most root every time you create a node, the process will gradually become slower and slower.
Therefore,
seems to be a realistic approach: "When the free space is filled up to a certain extent, don't try any harder in that area. Even if all the space in a certain area is not filled, at some timing or condition,
the leftmost position, where the search for a candidate transition destination begins, is shifted to the right.

In fact, in preparing the program for this article, I proceeded to examine what happens to the
free space as the double array grows.
As a result, we found that because of the large number of unbranched transitions that occur during processing,
those nodes fill up more and more of the free space from the left.
As a result, the creation of double arrays is smooth and fast.
Depending on the task (or input data), such as "when the majority of transitions are branching,"
the method of updating the free positions in this article may cause problems.

Due to the above circumstances, we did not deal with Patricia trees in this article.
If I have the opportunity to do so again, I would like to cover it.

This is my attempt to create a dictionary lookup program in Python. Thank you for reading.

Please feel free to arrange and use the program in this article as you like.
However, we are not responsible for any disadvantage caused by it.

1 2 3 4 Code