Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor

FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.


INTRODUCTION
T HE high demand for fast and low-cost genomic sequencing has pushed onward the rapid development of next-generation sequencing (NGS) technologies. As a result, a number of high-throughput sequencing systems have appeared in industry, including those from Illumina, Roche 454, Life Technologies and Pacific Biosciences. These systems are able to produce huge amounts of short reads (in the order of giga base-pairs) per day of operation. For instance, the Illumina NovaSeq 6000 sequencing system is able to produce up to 20 billion reads of 150 base pairs (bp) in less than two days. This represents up to 6 terabits of data which have to be processed as fast as possible.
Usually the first step in NGS corresponds to sequence alignment, where sequence reads must be aligned or compared to a genomic reference to identify regions of similarity [1]. Most popular alignment methods are based on two types of index structures: suffix trees and variants, and hash tables.
When dealing with large reference genomes, great efforts were devoted to reduce memory requirements for sequence alignment. As a result, a set of alignment algorithms based on the FM-index structure have been developed [2]. The FMindex uses Burrows-Wheeler transform (BWT), a method for rearranging a character string that is useful for data compression [3]. FM-index is well suited for fast exact matches of short reads to large reference genomes while keeping a small memory footprint. Many efficient sequence aligners are based on FM-index, such as Bowtie [4], BWA [5] (BWA-SW [6] for long reads), SOAP2 [7] and BWT-SW [8]. As high-throughput sequencing systems produce a massive amount of data, the usage of high-performance technologies is of crucial importance to deal with the computational challenge. In fact, many optimized exact matching algorithms have appeared recently in the literature for different architectures, like CPU clusters, GPUs and FPGAs [9].
Due to the data structure layout, the searching process using FM-index exhibits irregular memory access patterns. In addition, sequence aligners based on that index include support for inexact matching built on exact alignments, that causes the memory pattern to be even less predictable. These data access patterns cause a high cache miss rate on typical cache hierarchies of multicore processors. Besides, it is common for the exact matching algorithm to be memory bound due to the low arithmetic intensity (ratio of the computation to the memory traffic). Each step of the algorithm accesses a section of the index that it is not known in advance, making the cache hierarchy difficult to exploit. This paper analyzes the exact matching based on the FMindex data structure. The study focuses on bottlenecks related to data access patterns and computational capabilities. The evaluation assumes a batch or offline setting, where a bulk of queries is issued to be processed as fast as possible. Our main contributions can be summarized as follows: • Exact matching algorithms built upon FM-index are analyzed, focusing on those aspects related to computing costs and access to data, which have a great impact on performance.
• A new organization of the FM-index data structure layout and codification is proposed, which reduces the required traffic between memory and processor cores for the exact search process.
• An optimized exact matching algorithm has been implemented based on the proposed FM-index, exploiting the ultra high-bandwidth memory modules integrated in the KNL processor.
We have evaluated our proposal on a system with an Intel Xeon Phi 7210 processor [10], [11] (codenamed Knights Landing, or KNL) that includes 64 cores and 16 GiB of stacked 3D MCDRAM integrated on package. This memory offers a much higher memory bandwidth than the standard out-of-package DDR4 DRAM (∼400 vs. ∼ 90 GB/s peak). Our results show that performance on KNL reaches 95% of the peak random access bandwidth limit, outperforming other CPU and GPU solutions reported in the literature.

FM-index
The FM-index is a data structure that allows fast substring searches over large texts [2]. In the following subsections we show some basic definitions concerning the FM-index.
In the rest of the paper, brackets are used to specify an entry in an array (e.g., A[k] is k-th entry of the array A), and a character or a substring in a string (e.g., A[k] is the character at position k in the string A, and A[k..r] is the substring from position k to position r in A).

Suffix Array
Let T [1..n] be a character string drawn from an alphabet Σ having σ symbols. The suffix array SA[1, ..., n] of T is an array containing the starting positions of all suffixes of T in lexicographical order [12]. The suffix array can be used as an index to locate every occurrence of a pattern.

Burrows-Wheeler Transform (BWT)
The BWT is a permutation of a character string. Originally it was used in text compression algorithms, but it has other applications such as large text indexing.
.n+1] of a n-character string T is another string obtained as follows: (1) append to the end of the original text T the symbol $, which is lexicographically smaller than any symbol in Σ; (2) form a conceptual (n+1)×(n+1) matrix M whose rows are the cyclic shifts of T $ sorted in lexicographical order; (3) the last column of the matrix M is the BWT of T , denoted by L.
The suffix array of T $ contains the starting positions in T of every row of the matrix M .

FM-index data structure
The FM-index is composed of two data structures derived from L (BWT of T ): the C array and the Occ matrix. The C array stores in C[c] the number of occurrences in L of the symbols lexicographical smaller than c. Occ[c, i] contains the number of occurrences of the symbol c in the prefix L[1.
The FM-index was designed as a compressed structure such that the index size can be smaller than the original text. However, in the context of sequence alignment, it is usually not compressed in order to achieve better performance [1].

Exact Matching Using FM-Index
Given a pattern Q[1..p], the FM-index allows to find all occurrences of Q in the text T [2]. The search process takes two steps: Count and Locate. The first step is a rank query process that calculates the number of occurrences of Q in T by identifying the first and last rows of matrix M (see sec. 2.1.2) prefixed by the query Q. The second step uses the indexes of these rows to access the suffix array, where it finds the position of every occurrence of Q in the text T .
The suffix array is usually a very large compressed data structure. However, its size for the human genome is about 12 GB (3 gigabases x 4 bytes), so it can be stored without compression in modern systems. This way, the Locate step is very simple, it only requires an access to the suffix array. For this reason, this paper focuses in the Count step.

Rank Query Implementations
To speed-up the Count process, FM-index uses the Occ matrix as a look-up table [2]. The main drawback of this solution is the large memory footprint of Occ. It is a matrix with σ rows and n+1 columns. Hence, its footprint is: where the size of Occ entry depends on n. For the human genome (DNA), σ is 4 (A, C, G, and T) and n is around 3G (3 gigabases). Hence, each Occ entry fits in a 4-byte unsigned integer, and the footprint of Occ is about 4×3G×4 ≈ 48 GB. Several techniques have been developed to reduce this large footprint based on storing only a portion of Occ and calculating the rest of it [2], [13], [14]. These techniques are described and analyzed in detail in the next section. In this paper we propose a new organization of Occ that improves the performance of the Count algorithm, taking maximum advantage of the available memory bandwidth. Another approach to reduce the memory usage uses wavelet trees to store the Occ data [15], [16]. These structures are specially space efficient when performing rank queries on large texts based on large alphabets. The efficiency of this solution also depends on the entropy of the text. A rank query on an alphabet of σ symbols is done by log 2 (σ) binary rank queries. Each binary query calculates several indexes, accesses several memory locations, and performs some arithmetic operations.
In the DNA context, the alphabet is very small and the text is relatively small and with high entropy. Therefore, optimized versions of the Occ matrix fit in contemporary memory systems, so as the advantage of using wavelet trees regarding the space usage is not so relevant. However, its computational cost is much higher than those implementations based on a table-based Occ.
In the experimental evaluation section we compare our proposal with an FM-index implementation based on wavelet trees using the sdsl-lite library [17].

ANALYSIS OF EXACT MATCHING
This section presents a computational analysis, in terms of memory and performance, of the Count algorithm for different versions of the FM-index proposed in the literature using a table-based Occ structure. This analysis serves to introduce the main characteristics of the searching algorithm and to justify the our proposal.

Full FM-index
The full FM-index assumes that Occ is a pre-computed full look-up table, that is, it stores counts for each possible symbol and index. It can be used to quickly locate the occurrences of a pattern (query) Q[1.
.n], being p n. An exact matching algorithm using FM-index has been presented in [2]. Fig. 1 illustrates the Count step of this algorithm, called backward search (BS). At the end of the algorithm, the sp and ep variables contain the start and the end indices in the suffix array of T that contains Q as a prefix, respectively. The main operation in the search algorithm is a Lastto-First Mapping (LFM), which is performed by calling the function LF (), defined as follows: where i is the index of the loop and u is either sp or ep. Each iteration of the loop 3−6 in the BS algorithm accesses the Q string and makes two calls to the LF () function, one with sp and the other with ep. Note that in every loop iteration, sp (ep) is updated using the value computed in the previous iteration. That constitutes two dependency chains of calls to LF (), one for sp and the other for ep. We denote these chains LFM-chains (see Fig. 1).

Sampled FM-index
The full FM-index requires a large amount of memory space (see (1)) but the BS algorithm exhibits low computing cost. The sampled FM-index is a variant of the full version that introduces a trade-off between memory footprint and computing cost [2], [13].
The storage requirements can be reduced by replacing the Occ structure with a smaller one (rOcc). rOcc stores one column out of every d columns of Occ, that is, Fig. 2 being q=1+d× (p − 1)/d ≤ p, and occur(s,str) the number of occurrences of the symbol s in the string str.
A way of improving data locality consists of placing next in memory columns of rOcc and the blocks of BW T required to reconstruct Occ. This is accomplished in two steps (see Fig. 2). Firstly, rearranging the BW T text in an array of substrings of d consecutive symbols taken from BW T , called buckets [2]. The new data structure, named bBW T , is representing the symbol v of the bucket u. Secondly, combining both rOcc and bBW T data structures into a new one denoted by SFM (Sampled FM-index). SFM associates the column j from rOcc with the row j from bBW T . Specifically, a SFM row refers to a rOcc column (σ counters) and the bBW T bucket required to reconstruct the discarded Occ counters up to the next rOcc column.
A search algorithm based on the sampled FM-index follows a similar structure as the BS algorithm, but it requires to re-write the calculation of an LFM (see (2)) as follows: Note that each sampled LFM uses rOcc instead of Occ, which is d times smaller, but at the cost of performing more computation (the higher the value of d, the higher the computational cost).

K-step Sampled FM-index
The k-step sampled FM-index searches k symbols in a query in a single step [14]. This version reduces computing cost and improves data locality of the sampled version but increases slightly memory footprint.
To search k symbols in a single query, the original alphabet, Σ, is replaced by the set of k-tuples whose elements come from Σ (permutations with repetition). The new alphabet is denoted Σ k and its size is σ k . This change in the alphabet implies modifications in the sampled FMindex data structure. C is transformed into C k , which is indexed by k-tuples in Σ k (hence, its size is σ k ). rOcc becomes rOcc k , whose first dimension is also indexed by ktuples (thus, its size is σ k × (n+1)/d ). BW T is transformed into BW T k , which is composed of k (n+1)-symbol strings, namely the last k columns of the M matrix (see sec. 2.1.2). These k strings, however, can be encoded as only one (n+1)symbol string, where each symbol is now the concatenation of the k symbols of each row from the M matrix. Similarly to bBW T , BW T k , encoded as a single string of k-tuples of symbols, can be blocked into buckets of size d. This new structure is denoted bBW T k , an array of sub-strings composed by the concatenation of d k-tuples of symbols. As with SFM, the extended data structures, rOcc k and bBW T k , are combined into a new one denoted by SFM k (k-step ∑={A,B,…}, ∑ =σ ∑={A,B,…}, ∑ =σ ∑ 2 ={AA,AB,…}, ∑ 2 =σ 2 The k-step version of the calculation of a LFM (denoted by sLF k ()) is an extension of the single-step version, sLF () (see (4)), but using the extended Σ k alphabet and the extended data structures: C k , rOcc k and bBW T k . A single sLF k () call resolves k LFMs, that is, it is equivalent to k calls to sLF (). A call to sLF k () reads one piece of main memory containing the data to resolve k LFMs, exploiting in this way data locality. Moreover, the computational cost of sLF k () is slightly higher than that of sLF () and therefore the cost per LFM is much lower.

Memory footprint
The most memory consuming data structure is Occ, for full FM-index, and SFM k , for the sampled variants. In the case of DNA, the full Occ matrix is a big data structure (see (1)). The sampled versions, however, reduces largely this size depending on the parameters d (sampling factor) and k (symbols searched in a single step). Specifically, the memory footprint for the SFM k data structure is: where R is the size of the rOcc k entry.
Taking the human genome example (n=3 Gbases, σ=4) and d=64, the memory footprint for SFM k is 1.5 GBs, for k=1, and 4.5 GBs, for k=2, considering that a rOcc entry fits in 32 bits. These sizes are much lower than the footprint of the full Occ structure (48 GBs).
Values of k greater than 2 require a large amount of memory due to the exponential dependency of F p(SFM k ) on the number of steps (k). Taking the same example as above but for k=4, the size of SFM k increases to 51 GBs, greater than the original Occ structure.

Memory access pattern
One of the main performance limitations of the BS algorithm is related to the memory system. When executing this algorithm in an out-of-order processor, two LFM-chains are issued for each backward search query, overlapping their execution. Each of these LFMs obtains the Occ entry address (sp or ep) using the address from the previous iteration. Given how the BS algorithm works, the Occ memory access pattern for an LFM-chain is not predictable and it is distributed along the whole Occ matrix, showing neither spatial nor temporal locality. Hence, accesses to Occ result in a high cache miss rate.
However, computations from both LFM-chains are partially correlated. When a part of the query has already been performed, there are usually few matches in the reference text, and the start and end pointers (sp and ep) may have similar values. In that case, the pair of LFMs executed in a loop iteration likely access two Occ entries that are stored in the same cache block. We have measured the ratio of these cache block correlations for different query lengths and text sizes, assuming 64-byte cache blocks (see Fig. 3). It can be noted a high degree of cache block correlation between the two LF () calls within the same loop iteration.
To summarize, each pair of LFMs in a loop iteration reads two different cache blocks from main memory at the beginning of a query, but they are likely to access only one cache block when the query moves forward. Let α denote the average number of cache blocks read from main memory by each LFM pair in the same loop iteration, that is, being r the probability of sp and ep referring data from the same cache block (see Fig. 3). The value of α depends on factors such as the index size and the query length. For the sampled FM-index variants, an LFM reads one rOcc k entry and a sub-string from bBW T k (see (4) for k=1), both belonging to the same SFM k row. In order to minimize the number of cache blocks read from main memory, the SFM k row has to be properly aligned to cross the minimum number of cache block boundaries. Therefore, either d has a suitable value or the SFM k row has to be padded. Equation (6) has to be adapted for the sampled FM-index versions because d and k appear as new parameters. In particular, the average number of cache blocks read from main memory for each query step (performing 2k LFMs) is: where ∆ dk is the average number of accessed cache blocks for a pointer access (either sp or a ep), and r dk is the probability that both sp and ep refer to elements stored in the same SFM k row. Table 1 shows the footprint of Occ and α for the full FM-index (first row in table), as well as the footprint of SFM k and α dk for the sampled versions, using different values of k and d, and 64-byte cache blocks. r and r dk were obtained experimentally searching 20M sequences of DNA strings with an average length of 200 symbols in a full human genome reference (σ=4 and n=3G). ∆ dk was calculated assuming random accesses to rOcc k . Note that, for k=1 and d=32, the SFM k row size is 24 bytes, which is padded with 8 extra bytes in order to store two complete rows in a single cache block (similar situation for k=2 and d=16). It is worth noting that the huge memory footprint of Occ for the full version may result in frequent TLB misses for most of the modern processors.
Our design goal is to reduce α dk for a given k value. On the one hand, ∆ dk increases with growing values of d, since the SFM k row size increases. On the other hand, the

Search intensity
The arithmetic intensity is the ratio of the number of operations (work) to the amount of data traffic (in bytes) [18]. In the case of the FM-index backward search, we use the number of LFMs performed per transferred byte. Consequently, we name this metric search intensity (SI). For the full FM-index, an LFM pair needs to retrieve, in average, α cache blocks from main memory. Hence, the SI of the BS algorithm for a B-byte cache block is: For the sampled versions, search intensity is calculated in a similar way, but replacing α with α dk , that is: considering that 2k LFMs are searched per query step. Table 1 shows the search intensity for the full and sampled versions of FM-index. Search intensity is maximized for the pairs (k=1,d=192) and (k=2,d=128), obtaining similar values (0.0293 and 0.0290) and slightly better than that for the full version (0.0232). For k=2, SFM k occupies three times more memory than for k=1 (3 GB vs. 1 GB). The impact of searching k symbols in a query step on search intensity is compensated by the increase in the average number of blocks read from memory (α dk ). For example, for k=2 and d=128, the matching algorithm performs 4 LFMs in a query step (two sLF k=2 ()) instead of the 2 LFMs performed with k=1 and d=192. However, each step loads from memory an average of 2.15 cache blocks instead of 1.07, leaving the search intensity almost unchanged.

Throughput
Latency. Considering the full FM-index, the execution of the pair of LFMs issued in a iteration of a search query can be modeled with two logical phases (see Fig. 4). The first phase, MEM, corresponds to the memory operations associated to an LFM, which is mainly due to the access to Occ. The second phase, OP, corresponds to the computing operations of an LFM. Basically, this phase comprises the processing of three memory and one add instructions.
In an out-of-order processor with a non-blocking cache, the access to Occ with ep can be initiated while the cache is still servicing the access to Occ with sp in the same iteration (see Fig. 4). However, the next access to Occ must wait to finish the execution of the corresponding LFM of the previous iteration (due to the data dependence in the LFMchain). As the MEM phase requires typically hundreds of cycles (L mem ), while the OP phase lasts few cycles (L op ), the processor is idle most of the time (idle time in Fig. 4).
Assuming that a single hardware thread per core performing a complete query, the throughput is: where L iter is the latency of the iteration (see Fig. 4), and L LF M is the latency of a complete computation of an LFM.
Bandwidth. Equation (10) determines an upper bound of single-thread throughput in terms of latencies. Memory bandwidth, on the other hand, also limits the maximum achievable throughput. Given the search intensity (SI), the throughput of a core (single thread) can be calculated as: being T h BW system the throughput of the complete system, N cores the number of cores executing independent queries in parallel, and BW system the main memory bandwidth for the complete system.

Sampled versions.
With the sampled FM-index, the throughput upper bounds determined by query latencies and memory bandwidth are calculated by equations (10) and (11), respectively, but replacing SI with SI dk .
Regarding T h L core , the LFM latency (L LF M ) increases compared with the full version because of the larger instruction count in the computing phase (OP). Hence, the maximum throughput per core is lower in the sampled versions. Regarding T h BW core , throughput bound is maximum for k and d values that maximize SI dk (see Table 1).

Optimizing Throughput: Overlapped FM-index
The full and sampled FM-index variants have different characteristics in terms of memory footprint and data locality exploitation. However, their impact on throughput is much less important as the search intensity remains almost unchanged (see Table 1 ).
The exact matching algorithm (BS algorithm) is typically query latency bound, since many cycles are lost waiting for data (idle time in Fig. 4), wasting part of the available memory bandwidth. However, the memory latency responsible of the idle time can be hidden by issuing a given number of different independent queries, that is, by overlapping the memory accesses of several queries (batch or offline processing). This way, the throughput upper limit given query latencies is increased. The high number of queries which are usually involved in solving genome mapping problems makes this approach feasible.

MEM OP LF(Q[2][p-1],ep)
OP The resulting algorithm is shown in Fig. 5 for the full FM-index. The OBS algorithm executes a total of 2N q LFMchains for each iteration of the outer loop 3−9, corresponding to an array of N q different queries that are searched concurrently. Note that after computing the two LFMs required for a given query, two prefetch operations are issued to retrieve from memory the two Occ entries needed for computing the next two LFMs of the same query. The latencies of these memory reads are hidden by computing LFMs from other queries (see Fig. 6). By overlapping independent queries, the processor should be busy most of the time.

LF(Q[Nq][p-2],ep) Liter (rest of iterations) L'op
Assuming a single hardware thread per core, the minimum number of LFMs that must be overlapped in order to nullify the idle time is L iter /L op , where L op is the latency of the fraction of the OP phase that is not overlapped with the same phase of other LFMs (see Fig. 6). So, to reach such situation, N q must take a value such that 2 × N q =L iter /L op . We calculate L op for several implementations of the algorithm and for several processors in Section 5.
Using the above expression, the maximum throughput obtained by a core with the OBS algorithm is: Current architectures support simultaneous multithread- ing (SMT) [19], a technique that allows a single core to execute several interleaved independent execution flows (hardware threads). In this situation, the N q queries can be distributed among the hardware threads of a core.

Approach
In order to increase search intensity to improve bandwidth throughput α dk must be reduced. The upper part of Fig. 7 shows a row of the SFM k data structure, for k=1. All entries accessed in the computation of an LFM are marked in red, according to (4). Note that only a single entry in rOcc is accessed, so that if σ is large enough the substring accessed in bBW T may be stored in a different cache block. This occurs, for example, in the case of DNA (σ=4) and k=2. Since the alphabet size is 16 (σ k ), and each entry in rOcc has a size of 4 bytes, a single rOcc 2 column occupies a complete 64byte cache block (16 counters × 4 bytes/counter). We propose to change the SFM k layout and data codification so as all data needed to compute an LFM is stored in a minimum number of cache blocks.

Description
We denote the new data structure by bvSFM k , called split bit-vector sampled FM-index. Our solution comes from the observation that only one out of the σ k rOcc k entries is read for each LFM computation (see upper part of Fig. 7).
The bvSFM k structure is obtained from SFM k through two transformations: (1) partition each row of SFM k into σ k rows, where each of them consists of a single rOcc k entry combined with the complete bucket. Specifically, the row t of SFM k , that is, (a concatenation of the column t of rOcc k and the bucket t), is transformed into σ k rows of bvSFM k , of the form, (2) compression of each bucket using a bitmap where each symbol is represented by a single bit. This representation is as follows: given the row bvSFM k [(t-1)σ k +i, * ] (1≤ i ≤ σ k ), the corresponding bucket (bBW T k [t]) is replaced by a bitmap of length d, where a symbol in the bucket is represented by a set bit (1) if it is equal to the one associated to the entry rOcc k [i, t], and by a unset bit (0) otherwise. The lower part of Fig. 7 shows, as an example, the σ k rows of bvSFM k for k=1. Note that now, all data required to calculate an LFM (in red) are placed together in memory and in a compact way, minimizing memory bandwidth consumption. The transformation also allows the occur() function in (4) to be simplified. Now it has to count the number of set bits (1) in the accessed bucket, operation that can be efficiently performed by the popcount instruction.

Memory footprint
The new bvSFM k data structure has σ k rows for each row of the original SFM k structure, as shown in Fig. 7. Therefore, the memory footprint of bvSFM k is: where R is the size of the rOcc k entry. Note that the row size does not depend on the alphabet size (σ k ). Table 2 shows the size of a bvSFM k row and of the complete structure for the human genome for k=2 and different values of d. For all the selected d values, a bvSFM k row fits in a cache block. Compared to the corresponding SFM k values (see Table 1), the size of the whole data structure increases. For instance, with d=64, the bvSFM k=2 and SFM k=2 footprints are 12 GB and 4.5 GB, respectively. Current computing systems have enough memory to allocate this up-sized bvSFM k=2 data structures.

Search Intensity and Throughput
To minimize memory traffic, a value must be chosen for d such that a bvSFM k row fits in a single cache block. In addition, these rows must be cache-aligned in order to not splitting a row between two consecutive cache blocks. That means that the cache block size must be an integer multiple of the row size. Otherwise, the rows must be padded accordingly. Table 2 shows α dk and SI dk for different values of the sample distance d, assuming 64-byte cache blocks and k=2. Search intensity is calculated using (9). These values have been obtained in the same way as those of Table 1. Different from the SFM k=2 rows, which require two or more cache blocks to be stored, the rows of bvSFM k=2 fit in a single cache block. As a result, the value of α dk is lower for the bitvector version, and consequently its search intensity is about twice larger than that of SFM k . In addition, the sensitivity of SI dk with d is minimum for bvSFM, as its row size does not depend on d for the selected values.
The throughput upper bounds are calculated by (10) and (11). The proposed split bit-vector version improves SI dk of the exact matching algorithm and, hence, increases the throughput upper bound given by memory bandwidth.  In addition, as with the full and sampled versions, it is feasible to combine this data structure with the overlapping of independent queries (OBS algorithm). The resulting exact matching algorithm also improves the throughput limited by query latencies (see (12)).
The assessment is carried out in three processor architectures from Intel (see relevant features in Table 3). Table 4 shows the average number of instructions in the OP phase (see Fig. 4) of the calculation of an LFM for the different versions of the exact matching algorithm, as well as, the time spent by the processor to execute those instructions. We have analyzed optimized x86 machine codes. The Intel IACA tool [20] was used to analyze the hardware resource occupation in Broadwell and Skylake processors, while a similar analysis was made manually in KNL because the IACA tool does not support this architecture. In all processors, the resource that limits most the computation of the LFMs is the front-end unit, in charge of processing 2 (KNL) or 4 (Broadwell, Skylake) instructions per cycle. As expected, k1d32-SFM, k2d16-SFM and k2d64-bvSFM versions (those with 64-bit buckets) have low instruction counts because the operations in occur(s,str) (see (3)) translate into a few processor instructions if the bucket size is equal or smaller than the processor word (64 bits). Likewise, those versions with larger d values have higher instruction counts because they have to loop through d-symbol buckets.

Random Memory Access Benchmark
In order to have a more accurate value of memory bandwidth when issuing random memory accesses, we have developed a benchmark, called RANDOM, that performs memory operations following a random pattern similar to that in the different versions of FM-index. RANDOM uses C randomly generated linked lists with no access locality.
All the bandwidth tests have been conducted using the maximum number of hardware threads supported by cores and for a different number of linked lists (C). For C values beyond 6, the bandwidth reaches a peak and remains stable (see Fig. 8). For lower values, bandwidth is under the peak value because the memory latency cannot be completely hidden by the prefetch operations.
The maximum bandwidth corresponds to KNL MC-DRAM, able to provide about 219 GB/s. This value is much lower than the peak 400 GB/s reported for the STREAM benchmark [21]. On the other hand, the peak bandwidth provided by the DDR4 DRAM memory is about 68 GB/s (Skylake) and 44 GB/s (Broadwell).
The memory access latencies were also evaluated using the RANDOM benchmark. Three load tests were used: low load, medium load and high load. In the low load test, only one hardware thread in the complete processor executes the benchmark that runs over a single linked list (C=1). All the others threads remain idle. In the medium load test, the maximum number of hardware threads in each processor runs over a single linked list. In the high load test, the maximum number of hardware threads (same as medium load) execute the benchmark. Every thread, except one, runs over several linked list in order to have a high load in the system. The remaining thread runs just over one linked list in order to accurately measure the memory access latency. The number of linked lists was selected to be the one that achieves the best memory bandwidth (see Fig. 8).  Fig. 9 shows the RANDOM latency results for the three processor architectures. It can be noted that the latency increases significantly with the load in the system. This behaviour is expected as, when the amount of simultaneous queries increases, accesses to hardware shared resources are much more likely to conflict. Table 5 shows the throughput upper bounds determined by query latencies (processor computing capacity) and main memory bandwidth for all FM-index versions. These values have been obtained through the expressions (12) for computing time and (11) for memory bandwidth, taking as BW system those values shown in Fig. 8.

Throughput Bounds
As expected, the system throughput bounds imposed by memory bandwidth are much higher in KNL than in Broadwell or Skylake. The reason is that, in KNL, the FMindex structure is stored in its MCDRAM banks which provides a much higher bandwidth than DDR4 DRAM.
KNL cores have a lower computing capacity than Broadwell/Skylake cores but a much higher memory bandwidth. This fact results in more versions of the matching algorithm being compute bound for KNL than for Broadwell/Skylake.
The split bit-vector version performs best, mainly due to its higher search intensity compared to the other versions. Specifically, the memory bandwidth throughput bound for bvSFM is around twice as large as that of the SFM versions.
In summary, the best result for Broadwell is 2.52G LFM/s and it is achieved with the k2d96-bvSFM version, with a memory footprint of 8 GB. For Skylake, the best result is 3.94G LFMs/s, obtained with the same version. Finally, for KNL, this version is compute bound, achieving the best throughput with the k2d64-bvSFM version (12.61G LFM/s), with a memory footprint of 12 GB. In general, the split bitvector data structure strongly improves the throughput for all processor architectures.

Experimental Setup and Methodology
The evaluation was conducted on three different platforms:   We used the Intel C Compiler (ICC version 17.0.4) with common flags, -O3 -qno-opt-prefetch, and architecture dependent flags, -xMIC-AVX512, -xCORE-AVX512, and -xCORE-AVX2, for the KNL, Skylake and Broadwell systems, respectively. Thread-level parallelism has been exploited by using all the available threads in all the physical cores. For the KNL system, all FM-index data structures were placed in the MCDRAM, and it was configured in memory flat mode and in quadrant clustering mode [22]. In addition, we used 1 GiB huge TLB pages to avoid TLB misses. The overlapping factor, N q , was set to 4 for KNL, and to 20 for Broadwell and Skylake, being the minimum number of queries to be overlapped to keep busy the processor for all the versions.
A set of 20M queries generated by the Mason simulation tool [23] was used as input data. These queries (200 symbols in average) have been searched in the human genome reference GRCh38 (3 gigabases). All experiments were conducted after loading the sequences into memory. In general, the experimental values reasonably approximate the theoretical ones. The differences are due to processor features not taken into account in the models, specially, the penalty caused by branch miss-predictions. The actual throughput is about 95% of the theoretical limit for exact matching algorithms with 64-bit buckets, and it drops to around 80% for larger bucket sizes. In these cases, the count of coincidences in a variable number of 64-bit pieces forces to execute more branches, which also have unpredictable behavior because they are dependent on the input data.

Throughput
The branch miss-prediction penalty of those versions with buckets larger than 64 bits adds to the substantial increase in the instruction count (see Table 4). As a result, throughput for versions with large buckets (more than 64 bits) is lower than that of versions with shorter buckets.
The exact matching algorithm based on the proposed split bit-vector data structure outperforms by about 60% and 90% the best of previous implementations, executed in Broadwell and Skylake processors, respectively. In KNL, our proposal outperforms by about 135% the best of previous solutions adapted to this processor. In addition, the best throughput in KNL, obtained for k2d64-bvSFM version, is about 6x and 3x that achieved by Broadwell and Skylake, respectively. This improvement is mainly due to the ultra high-bandwidth provided by the MCDRAM memory.

Roofline Model
The roofline model [18], [24] is a simple and intuitive visual method that provides performance upper bounds for an application running in a given architecture. This model is based on the concept of arithmetic intensity. However, since the backward search algorithm does not perform floatingpoint operations, the search intensity is used instead. Fig. 11 shows the roofline model of different FM-index matching algorithms on the three processor architectures. The model considers the main memory peak bandwidth and the experimental results obtained when performing random memory accesses (described in section 5.2) as the memory bandwidth bounds. This random access bandwidth is, in fact, the hardware resource that really limits the algorithm performance for the best FM-index implementation (k2d64-bvSFM) in all processor architectures. This algorithm version is able to use up to 95% of the peak bandwidth for the KNL processor (with vector extensions, AVX-512, extensively used as intrinsics in the computation of an LFM) and almost all the available bandwidth for the Broadwell and Skylake processors.  Fig. 11. Broadwell, Skylake and KNL roofline models respectively. The NVIDIA Tesla Pascal P100, a modern GPU that includes HBM2 high-bandwidth memory technology, reports a throughput of 2.7G LFMs/s, corresponding to the execution of a NVBIO match function.

Double-level bucket FM-index
rOcc counters can be organized as a two-level structure to reduce memory usage [27]. The BWT of T (L) is divided into s-symbol superbuckets, each one divided into d-symbol buckets. A first-level (L1) of counters stores, for each super-bucket, the number of occurrences of each symbol in Σ k from the beginning of L up to the beginning of the bucket. Similarly, a second-level (L2) of counters stores, for each bucket, the number of occurrences of every symbol from the beginning of its superbucket. Considering 32-bit L1 and 16-bit L2 counters, and the human genome (n=3G, σ=4), L1 counters take up 2.9 MB of memory space (for k=2) and, thus, they can be stored in the last-level cache (LLC) of both Broadwell and Skylake processors (25 and 19.25 MB shared among all cores, respectively), but not in the KNL LLC cache (1 MB private for the two cores in a tile). Our results confirm this analysis. With d=16, this version reaches 1.75G, 3.13G, and 4.32G LFM/s for the Broadwell, Skylake and KNL, respectively. Those values are 89%, 86%, and 37% of the k2d64-bvSFM throughput.
Several works focus on improving the performance of the exact matching algorithm (FM-index) for GPUs, like Chacon et al. [25] and Chen et al. [37]. FM-index is also included in the NVBIO [38] library, developed by Nvidia to speed up bioinformatics using GPUs and CUDA technology.
The most relevant operation in the FM-index backward search algorithm is the rank operation [39]. This operation, together with the select one, has been addressed in numerous papers which focus on optimizing both the memory footprint and the pattern search time [40]. Most of this papers are based on succinct data structures [41], [42], [26] and wavelet trees [43], [44].
Unlike mentioned previous works, our paper focuses on improving the pattern search time for genomic data on CPU, specially those with many cores and high bandwidth memory. Unlike CPUs, GPUs exploit fine-grained massive parallelism, but the performance drops when the control flow diverges or the data access pattern is irregular. However, while improving pattern search time, we increase memory footprint, getting close to other fast exact matching algorithms, like suffix array binary search [45].
Even if these techniques can also be very fast, FM-Index can be used for non-exact matching. Several aligment tools are based on it and could be benefited from the ideas and techniques described in this paper.

CONCLUSIONS
This paper presents a new data layout organization of FM-index that boosts throughput thanks to an increase in the search intensity. Basically, our optimized data structure packs all relevant data needed in a query step within a single cache block, minimizing the memory bandwidth demand.
We have experimentally evaluated an exact search algorithm based on the proposed FM-index structure using three multi-core processors. Our proposal outperforms by about 60%, 90% and 135% the best of previous implementations.
The best performance was obtained in the Intel Xeon Phi (KNL) architecture, mainly because of the high peak random access memory bandwidth. Our implementation is able to obtain a throughput of 12G LFM/s, being about 3x faster than previous GPU implementations and about 4.4x faster than the GPU version implemented in the NVIDIA NVBIO bioinformatics library executed on a Tesla Pascal P100.