Introduction to Text Processing: Build a Fast Dictionary in Python (Part 2)

Table of Contents
[Part 1]
- Importance of text processing (text processing)
- What is a dictionary lookup (what this article covers)
- Basic Concept of Trie Trees (TRIE)
- About double arrays
[No.2]
[No.3]
[No.4]
Creating a double array
How is a Trie tree realized in a double array?
First, a one-dimensional array of integers called "base" is prepared.
One element of this array represents one node.
The index of the base array starts at 1. (The 0 position is not used.)
That is, the node with the number 1 is the root. There is a 1 here.
base[1] = 1.
The first characters of the headword are "de" and "do" in the previous example.
So where is the next node that root transitions to by "de"?
To prepare, assign an identification number to each character of the headword.
You can use the character code of some encoding as it is, but since there are only a few
characters, we will start with 1 and assign a character ID appropriately here.
The order of appearance of words and letters is as follows.
and : 1 hmm... : 2 Do : 3 : 3 ko : 4 1 : 4 soil : 3 : 4 : 5 10 :5 : 6 : 6 b : 8 : 9 : 10 : 7 Ye : 8 : 8
The "de" is now 1 and the "do" is now 3. code['de']=1, code['do']=3.
By the way, as I will explain later, the character ID number 0 is the end of a word (transition to a leaf node).
Since we mentioned leaf nodes, we will also assign IDs to words in the order of appearance.
Den : : 1 :1 :1 where :2 don't :3 :3 :3 don :3 :4 bam bam bam bam bam bam :5 Donbe :5 :6
Now, in this way, the position of the node that transitions from root by the letter "de" takes the value of
root's base plus the character ID. In other words:
base[1] + code['de'] = 1 + 1 = 2
which becomes: base[1] + code['de'] = 1 + 1 = 2. Do the same for "de",
base[1] + code['do'] = 1 + 3 = 4
Now the base array is newly filled with numbers 2 and 4.
As you can see, the destination node position is the value of "base value + character ID".
It's kind of amazing. The character information has been absorbed into the "transition movement amount.
The problem here is that we can transition from
root to any node by adding the value of any appropriate character ID to the value of base[1].
This is not good, because I want the next characters to be only "de" and "do".
So, we prepare another one-dimensional array of integers called "check".
Then, in check, we put where the node came from (=index of the parent node).
If the parent node is root, it is 1. In addition, root's own check should be 0. In the formula:
check[1] = 0
check[2] = 1
check[4] = 1
The name "double array" comes from the fact that the Trie tree is represented by two arrays, base and check, as shown here:
It may be better to think of
as "a node in a double array has location information, a base component, and a check component.
Let's make a table to show what we have so far.
index | 1 | 2 | -base | 4 |
base | 1 | -base | -base | -base |
check | 0 | 1 | -base | 1 |
code | 1 | -base | 3 | |
Transition | root | and | -base | root |
Now nodes 1, 2, and 4 are filled, and 3 is still empty.
How do we determine the value of base for nodes 2 and 4?
When we transitioned from position s to position t by letter c, we had
t = base[s] + code[c]
check[t] = s
If the value of base[s] is determined as desired, when more and more nodes are added in the future, there will be positional batting with nodes that have existed since before
.
Conversely, we decide the value of
base[s] so that the next node will transition to a vacant position.
If we look at the headwords: "den," "doko," "don," "donchan," "donde," and "dongbe,"
For example, from the second character "n" of "don," there are four branches to
child nodes: "word end," "chi," "do," and "be," so we must make sure all of them do not bat against the others.
The BASE component can be called the offset value.
In general, the offset value is determined to be the smallest, i.e., the combination of transition destinations should be on the left side as much as possible.
As the double array grows, it is as if the empty spaces are being filled from the left side.
Now, for the time being, I would like to look at the growth of the double array, one step at a time.
Actually, there is only one more important concept left to explain, that of "leaf node processing," so please feel free to read on.
Even if you don't get it the first time, I am sure you will get it if you read it several times while taking notes.
By the way, this time, since we are registering words in order of their IDs, we are creating
depth-first.
Another way to do it is to register the transition from root to the first character for all headwords, then
register the transition from the first character to the second character for all headwords, and so on.
This would create a breadth-first transition.
Step 1: "de" -> "den
index | 1 | 2 | 3 | 4 |
base | 1 | 1 | -1 | -base |
check | 0 | 1 | 2 | 1 |
code | 1 | 2 | 3 | |
Transition | root | and | de→n | root |
Transition from source s = 2. The character ID is code['n'] = 2, so if
base[2] = 1, the destination is t = base[s] + code[c ] = 1 + 2 = 3, and the transition can be made to
where there is an empty space (left end). Put the parent node position in check there.
check[3] = s = 2
Now, we've reached the end of the word with "Den". Since there is no other headword that follows "Den...", there is no branch in the
transition (the leaf node has no sibling nodes).
Normally, a leaf node is created with code[leaf node] = 0, considering the transition destination.
If there is no branching, do not bother to create a leaf node, but put the sign of the
word ID in the base component as a negative number by inverting it.
This saves memory and slightly increases search speed.
The word ID of "den" is 1, so base[3] = -1
In the table above, the transition from "de" to "den" is simplified to "de→n".
Step 2: "Do" → "doko", "dondo
index | 1 | 2 | 3 | 4 | 5 | -base | 7 |
base | 1 | 1 | -1 | 3 | -base | -base | -2 |
check | 0 | 1 | 2 | 1 | 4 | -base | 4 |
code | 2 | 4 | |||||
Transition | root | and | de→n | root | Do→n | do→ko |
The transition is from transition source s = 4.
There are two transitions from "de" to "ko" and "n".
Since code['ko'] = 4 and code['n'] = 2, we can determine the node position of the transition destination well at
by setting base[4] = 3. Let's look at each of them.
Do" -> "where"
destination t = base[4] + code['ko'] = 3 + 4 = 7, hence check[7] = s = 4
"where" came at the end of the word, but there is no branch (there is no other headword that follows "where..."), so we can use
as before. As before, sign-reversing the 2 in the word ID, base[4] = -2
Don" -> "Don"
Transition destination t = base[4] + code['n'] = 3 + 2 = 5, therefore check[5] = s = 4
"Don" came at the end of the word, but from "Don" there are other branches such as "Chi", "Do", and "Be".
We cannot put the word ID (with the sign of the word reversed) in base[t], and we have to create a new leaf node.
That's all for now, and we will consider this in the next step.
Thus, in double arrays, even if you are creating them using the depth-first method, you must always consider the width (number of branches) of the
connection to the next node as you create it. This is a bit interesting.
Step 3: "Don" → "Don (end of word)", "Donchi", "Dondon", "Dombee"
index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | -base | 9 | -base | 11 | -base | 13 |
base | 1 | 1 | -1 | 3 | 6 | -3 | -2 | -base | -base | -base | -base | -base | -base |
check | 0 | 1 | 2 | 1 | 4 | 5 | 4 | -base | 5 | -base | 5 | -base | 5 |
code | 0 | 3 | 5 | 7 | |||||||||
Transition | root | and | Den : : 1 :1 | root | Dong→dong | Dong→dong | where | do→doo | thud | thump thump |
Transition from source s = 5.
Since code[end of word] = 0, code['ch'] = 5, code['do'] = 3, code['be'] = 7,
base[5] = 6, the position of the destination node is well determined.
Let's look at each of these.
Don' -> end of word
This is the process we put off earlier.
Destination t = base[5] + code[end of word] = 6, therefore check[6] = s = 5
Since it is the end of the word, we invert the sign 3 of the word ID of "don" and put it in.
base[6] = -3
If there is a branch, create a new leaf node like this
and put the word ID (with the sign reversed) there, which is different from the case of
where there is no branch.
Don" -> "Donchi"
Transition destination t = base[5] + code['chi'] = 6 + 5 = 11, hence check[11] = s = 5
Don" -> "Dondon"
Transition destination t = base[5] + code['do'] = 6 + 3 = 9, hence check[9] = s = 5
Don" -> "Donbe"
destination t = base[5] + code['be'] = 6 + 7 = 13, hence check[13] = s = 5
Step 4: "DONCHI" → "DONCHA" → "DONCHAN
index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | -base | 13 |
base | 1 | 1 | -1 | 3 | 6 | -3 | -2 | 8 | -base | -4 | 2 | -base | -base |
check | 0 | 1 | 2 | 1 | 4 | 5 | 4 | 11 | 5 | 8 | 5 | -base | 5 |
code | 6 | 2 | |||||||||||
Transition | root | and | Den : : 1 :1 | root | Dong→dong | Dong→dong | where | Dong-cha | do→doo | Dong-cha | thud | thump thump |
So far we have progressed one character at a time, but there is no branch for "Donchi" -> "Donchan"
I will summarize the explanation for the two characters.
DONCHI -> DONCHA
source of transition s = 11. code['ya'] = 6, so if base[11] = 2, destination t = base[11] + code['ya'] = 8
therefore check[8] = 11
Don-chan" -> "Don-chan"
Transition source s = 8. code['n'] = 2, so if base[8] = 8, destination t = base[8] + code['n'] = 10
Therefore check[10] = 8
We have reached the end of the word, but there is no branch for the transition to the leaf node either. The word ID is 4, so
reverse the sign and base[10] = -4
Step 5: "Dondon" → "Dongdon", "Dongbe" → "Dongbe"
index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
base | 1 | 1 | -1 | 3 | 6 | -3 | -2 | 8 | 10 | -4 | 2 | -5 | 6 | -6 |
check | 0 | 1 | 2 | 1 | 4 | 5 | 4 | 11 | 5 | 8 | 5 | 9 | 5 | 13 |
code | 2 | 8 | ||||||||||||
Transition | root | and | Den : : 1 :1 | root | Dong→dong | Dong→dong | where | Dong-cha | do→doo | Dong-cha | thud | Dobon-dobon | thump thump | Donbe |
The double array is complete. (Hey, explanation...?) Thank you for your time.
Finally, I would like to summarize some important points regarding the creation of double arrays.
Prepare two arrays, base and check.
The root position is 1.
base[1] = 1, check[1] = 0
The relational expression for the transition from position s to position t by letter c is:
t = base[s] + code[c]
check[t] = s
First, base[s] must be determined. There may be one or more c characters.
The destination should be decided so that the unused area of the double array is filled to the left as much as possible, taking the branching into consideration.
When transitioning to a leaf node (when coming to the end of a headword), if the word ID is wid,
t = base[s]
check[t] = s
base[t] = -wid
However, if there is no branching in the transition (no longer word with forward matching), as a special case, do not create a leaf node at position t, but create one at position s and do the following
base[s] = -wid