Introduction to Text Processing: Let's Make a Fast Dictionary in Python (Part 3)

Table of Contents
[Part 1]
- Importance of text processing (text processing)
- What is a dictionary lookup (what this article covers)
- Basic Concept of Trie Trees (TRIE)
- About double arrays
[No.2]
[No.3]
[No.4]
Python Implementation of Double Arrays (Part 1)
The above explanation is directly translated into a Python program.
(If you start reading from here and are concerned about the contents, please go back and look through it lightly.
I will write the policy and notes first.
The creation of a double array allows us to break down the entire process into
phases : transitions from one node to the next.
This is done for all phases.
Consider what kind of input is needed to process a single phase,
For any string that is a forward match for the headword (including the empty string and the headword itself), we first need
the set of the next character (including the end of the word).
The substring is represented by a character ID tuple, and the next character is represented by a character ID set, which is stored in dict.
In this article, we will call it a "branching database.
The other is the node position of the transition source in the phase.
In other words, it is the position of the node when it transitions from the beginning to the end of the substring.
This is also stored in
dict (defaultdict) with the tuple of the substring character ID as the key and the node position at the time of transition to that point as the value. In this article, we will call it the "transition source database.
The branch database should be created in advance.
The items in the transition source database are added while creating a double array.
When the calculation of the phase corresponding to an item is completed, it is no longer used, so it is deleted.
Advantages and Cautions"
Now, there are two advantages to a calculation method that processes each phase and repeats for all phases.
The first is that it does not require recursion.
Using recursion may seem easier to implement, but it is also more difficult when you think about
the runtime, such as taking care not to make the loop too deep.
On the other hand, if you try to write a loop without recursion, the algorithm generally becomes more complicated.
In this case, too, I am trying to write it in a loop, but since the process has already been broken down into phases,
it is easy to simply loop through as many phases as there are phases.
Another advantage is that the order of calculation, such as width-first or depth-first, becomes irrelevant to the program design.
You can change the order in which the input data (branch database items) are retrieved to support either method, or you can use
either or neither method. (In this case, depth is the priority.)
This is all well and good, but there is a caveat.
When calculating a transition, the calculation of the previous transition must have already been completed.
This is because the node position of the transition source in that phase must be in the transition source database.
We will keep this in mind when preparing the branch database.
Python Implementation of Double Arrays (Part 2)
Let's write the program. First, let's prepare the input data (list of headwords).
The six words "den" and "where"... familiar from the previous explanations can be used as headwords, but since we are going to
, we will use the title list of the Japanese version of Wikipedia, as written at the beginning of this article.
Almost every character is included in the title by itself, so use a heading of two or more characters.
At the time of writing this article (2018.10), there are about 1.8 million of them. That is a perfect amount.
We will re-save this as words.txt. The word ID of the headword should be a line number.
(If it were an actual dictionary, it would be the record number of the word sense data, etc.)
The process is posted here. I will provide a link to download the title so you can do it manually.
Now, let's write the functions needed to create the double array.
Function: get_branch_db
The first function is to prepare the branch database. The explanation is repetitive, but
"substring from the beginning" => "the next character" is registered in the dict.
The function also assigns an ID to each character. Also, make sure that the order described in the notes above is satisfied.
By the way, we would like to use defaultdict for this process and still maintain the key registration order.
This behavior is not guaranteed depending on the Python version or implementation.
We could create such a dict as a new class, but for now we'll go with a simple and naive solution:
create an array of keys separately from the dict.
Function: get_base_s
Create a function to find the base[s] of the transition from position s.
Create an array of offsets by subtracting the smallest character ID from each element of the array of character IDs of the character to be branched.
Find a position where each element of the offset array is as far to the left of the unused area as possible.
From there, we can work backward to obtain base[s].
Function: update_slot_start
Related to the above, when looking for an empty slot to move to, if you always search from the root,
it will become slower and slower as the number of nodes increases.
Since the leftmost position in a double array is (supposed to be) filled from the left, let's create a function that searches for the leftmost position and call it
as needed.
Function: make_darray
Function to create a double array (base, check). The input of the function is a branch database.
The algorithm is written in pseudo code-like sentences.
___ Empty transition source database (substring) => node position) is prepared.
____Retrieve pairs of "substring and character set of transition destination" from branch database (loop)
____ update the left end of the free position (update_slot_start)
______ first calculation (transition from root)
________ Transition source s = 1
Processing of ________ transitions (omitted, as the basics are the same as the contents that follow)
________Registers the location of the transition destination in the transition-source database
______ other than that:
________ s The value of s is obtained from the transition-source database.
________ when it is a leaf node and does not branch:
__________ base[s]. with a negative number of the headword ID with a negative number.
________ otherwise:
__________Take one character from the character set of the destination (loop)
____________ base[s]. find the base character set (loop) (get_base_s)
____________ destination t = base[s] + c
____________ when check[t] = s
____________ transition destination is a leaf node:
______________ base[t]. with a negative number of the headword ID with a negative number.
____________ other than that:
______________Register the location of the transition destination in the transition-source database
________ Delete the item from the transition-source database when the calculation is finished
Function: save_darray
Writes the double array that has been created to a file.
For now, we save 3 bytes per element of the array.
(Normally, this is not a good idea, so do the math!)
) Binaries are handled by convenience functions that do not require pack or unpack.
Also, the correspondence table between characters and character IDs is output to a text file in the form of a string.
Finally, the calls to the above processes are summarized in the main function.
Here is the code.
[Code Example 3 and Output Example 3
Have you successfully output the three files darray.{base,check,code}?
Approximately 7.7 million nodes have been created.
With a Core i7 CPU and 8 GB of memory, this should be a 1-2 minute process.
Now, let's finish the search using this double array.