The Morning Paper上的评论：https://blog.acolyer.org/2018/01/08/the-case-for-learned-index-structures-part-i/

# 概述

Learned Index Structure巧妙地把传统数据库系统研究和最新的机器学习技术结合起来，这很可能预示着未来的系统软件研究方向。本文首先简要概述这篇文章的核心想法，然后再深入其各种细节。

“…we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.”

“the engineering effort to build specialized solutions for every use case is usually too high… machine learning opens up the opportunity to learn a model that reflects the patterns and correlations in the data.”

# CPU、GPU和TPU

## Learned (range) indexes

“At first sight it may be hard to provide the same error guarantees with other types of ML models, but it is actually surprisingly simple. The B-tree only provides this guarantee over the stored data, not for all possible data. For new data, B-Trees need to be re-balanced, or in machine learning terminology re-trained, to still be able to provide the same error guarantees. This significantly simplifies the problem: the min- and max-error is the maximum error of the model over the training (i.e. the stored) data.”

• 调用Tensorflow的额外开销太大，特别是从Python发起调用时（这是纯粹的工程问题，网络的推断过程要远比训练简单，可以通过将训练好的网络权重输出，然后用C/C++重新推断过程来解决）
• 按机器学习的说法，对于每一个Key，或者说采样点，B树都是严重过拟合的。它追求的是让其输出的偏移精确贴合真实偏移，误差不超过一个page。然而通过训练得到的CDF函数是一个回归过程，它实际上正在试图对样本偏移的分布进行泛化，这要求网络只能大概地给出一个偏移的位置，而不是精确地走完“最后一公里”，换句话说对于每一个独立数据点就不会贴合得那么好。这并不是在这个场景里我们想要的：我们完全没有兴趣去预测如果日后新添加一些Key，它可能的偏移会是什么样的，我们追求的恰恰正是传统DNN训练中竭力避免的事情——过拟合，完全丢掉网络的泛化、预测能力。
• 传统DNN训练中的目标一般是最小化平均误差，比如最小化误差的平方和。而我们这里追求的实际上是最小化最大误差。
• B树本身的实现一般针对Cache做了高度优化，标准的神经网络实现在这方面考虑较少。

## 改进learned (range) index

### recursive regression model （递归回归模型）

“…we build a hierarchy of models, where at each stage the model takes the key as an input and based on it picks another model, until the final stage predicts the position.”

“For example, whereas on the top-layer a small ReLU neural net might be the best choice as they are usually able to learn a wide-range of complex data distributions, the models at the bottom of the hierarchy might be thousands of simple linear regression models as they are inexpensive in space and execution time.”

RMI的输出是一个Page，在对于数据分布没有任何先验知识的情况下，传统做法对这个page要么就顺序遍历，要么二分查找来找出最终的数据位置。然而……

“Yet again, learned indexes might have an advantage here as well: the models actually predict the position of the key, which likely to be much closer to the actual position of the record, while the min- and max-error is presumably larger.”

RMI模型在速度上全面优于B树，且占用更少的空间。其他数据集也有类似结果，此处不再赘述，感兴趣的读者可以参考原始论文的图5、图6。

## Learned point indexes

“Therefore, most solutions often allocate significantly more memory (i.e., slots) than records to store… For example, it is reported that Google’s Dense-hashmap has a typical overhead of about 78% of memory.”

“Surprisingly, learning the CDF of the key distribution is one potential way to learn a better hash function… we can scale the CDF by the targeted size M of the hash-map and us h(K) = F(K)*M, with key K as our hash-function. If the model F perfectly learned the CDF, no conflicts would exist.”

## Learned existence index

“… if there is some structure to determine what is inside versus outside the set, which can be learned, it might be possible to construct more efficient representations.”

# 结语

“…we have demonstrated that machine learned models have the potential to provide significant benefits over state-of-the-art database indexes, and we believe this is a fruitful direction for future research.”