# LCP array

In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes between pairs of consecutive suffixes in the sorted suffix array.

Augmenting the suffix array with the LCP array allows to efficiently simulate top-down and bottom-up traversals of the suffix tree,Template:SfnTemplate:Sfn speeds up pattern matching on the suffix arrayTemplate:Sfn and is a prerequisite for compressed suffix trees.Template:Sfn

## History

The LCP array was introduced by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm.Template:Sfn

## Example

 Template:Left header | i Template:Left header | S[i] 1 2 3 4 5 6 7 b a n a n a $ Template:Left header | i Template:Left header | A[i] 1 2 3 4 5 6 7 7 6 4 2 1 5 3 Complete suffix array with suffixes itself : Template:Left header | i Template:Left header | A[i] Template:Left header | 1 Template:Left header | 2 Template:Left header | 3 Template:Left header | 4 Template:Left header | 5 1 2 3 4 5 6 7 7 6 4 2 1 5 3$ a a a b n n $n n a a a a a n$ n $n a a a n$ $a$

Then the LCP array $H$ is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:

## Efficient Construction Algorithms

LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values.

Template:Harvtxt provide an algorithm to compute the LCP array alongside the suffix array in $O(n\log n)$ time. Template:Harvtxt show that it is also possible to modify their $O(n)$ time algorithm such that it computes the LCP array as well. Template:Harvtxt present the first $O(n)$ time algorithm (FLAAP) that computes the LCP array given the text and the suffix array.

Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of $13n$ bytes, while the original output (text, suffix array, LCP array) only occupies $9n$ bytes. Therefore Template:Harvtxt created a refined version of the algorithm of Template:Harvtxt (lcp9) and reduced the space occupancy to $9n$ bytes. Template:Harvtxt provide another refinement of Kasai's algorithm ($\Phi$ -algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the permuted LCP (PLCP) array, in which the values appear in text order rather than lexicographical order.

Template:Harvtxt provide two algorithms that although being theoretically slow ($O(n^{2})$ ) were faster than the above mentioned algorithms in practice.

As of 2012, the currently fastest linear-time LCP array construction algorithm is due to Template:Harvtxt, which in turn is based on one of the fastest suffix array construction algorithms by Template:Harvtxt.

## Applications

As noted by Template:Harvtxt several string processing problems can be solved by the following kinds of tree traversals:

• bottom-up traversal of the complete suffix tree
• top-down traversal of a subtree of the suffix tree
• suffix tree traversal using the suffix links.

Template:Harvtxt show how to simulate a bottom-up traversal of the suffix tree using only the suffix array and LCP array. Template:Harvtxt enhance the suffix array with the LCP array and additional data structures and describe how this enhanced suffix array can be used to simulate all three kinds of suffix tree traversals. Template:Harvtxt reduce the space requirements of the enhanced suffix array by preprocessing the LCP array for range minimum queries. Thus, every problem that can be solved by suffix tree algorithms can also be solved using the enhanced suffix array.Template:Sfn

Deciding if a pattern $P$ of length $m$ is a substring of a string $S$ of length $n$ takes $O(m\log n)$ time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to $O(m+\log n)$ time.Template:Sfn Template:Harvtxt show how to improve this running time even further to achieve optimal $O(m)$ time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the suffix tree.

The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and lowest common ancestor queries.Template:SfnTemplate:Sfn Furthermore it can be used together with the suffix array to compute the Lempel-Ziv LZ77 factorization in $O(n)$ time. Template:SfnTemplate:SfnTemplate:SfnTemplate:Sfn

The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array.

### Suffix Tree Construction

Given the suffix array $A$ and the LCP array $H$ of a string $S=s_{1},s_{2},...s_{n}\$ of length $n+1$ , its suffix tree $ST$ can be constructed in $O(n)$ time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array.

We need to distinguish two cases:

A simple amortization argument shows that the running time of this algorithm is bounded by $O(n)$ :

The nodes that are traversed in step $i$ by walking up the rightmost path of $ST_{i}$ (apart from the last node $v$ ) are removed from the rightmost path, when $A[i+1]$ is added to the tree as a new leaf. These nodes will never be traversed again for all subsequent steps $j>i$ . Therefore, at most $2n$ nodes will be traversed in total.

### LCP queries for arbitrary suffixes

Because of the lexicographic order of the suffix array, every common prefix of the suffixes $S[i,n]$ and $S[j,n]$ has to be a common prefix of all suffixes between $i$ 's position in the suffix array $A^{-1}[i]$ and $j$ 's position in the suffix array $A^{-1}[j]$ . Therefore the length of the longest prefix that is shared by all of these suffixes is the minimum value in the interval $H[A^{-1}[i]+1,A^{-1}[j]]$ . This value can be found in constant time if $H$ is preprocessed for range minimum queries.