索引(Index)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
An index is an additional structure that is derived from the primary data. Many databases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest possible write operation. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. For this reason, databases don’t usually index everything by default, but require you—the application developer or database administrator—to choose indexes manually, using your knowledge of the application’s typical query patterns. You can then choose the indexes that give your application the greatest benefit, without introducing more overhead than necessary.
<<Designing Data Intensive Applications»
当表中有大量记录时,若要对表进行查询,
- 第一种搜索信息方式是全表搜索,是将所有记录一一取出,和查询条件进行一一对比,然后返回满足条件的记录,这样做会消耗大量数据库系统时间,并造成大量磁盘I/O想操作;
- 第二种就是在表中建立索引,然后在索引中找到符合查询条件的索引值,最后通过保存在索引中的ROWID(相当于页码)快速找到表中对应的记录。
索引能够提高 SELECT 查询和 WHERE 子句的速度,但是却降低了包含 UPDATE 语句或 INSERT 语句的数据输入过程的速度。索引的创建与删除不会对表中的数据本身产生影响。
索引目的
索引的目的在于提高查询效率,可以类比字典,如果要查“mysql”这个单词,我们肯定需要定位到m字母,然后从下往下找到y字母,再找到剩下的sql。如果没有索引,那么(在最坏的情况下)你需要把所有单词都看了一遍,才能找到你想要的,因为这时候你要找的单词在文本末尾。
Usage
Support for fast lookup
Most database software includes indexing technology that enables sub-linear time lookup to improve performance, as linear search is inefficient for large databases.
Suppose a database contains N data items and one must be retrieved based on the value of one of the fields. A simple implementation retrieves and examines each item according to the test. If there is only one matching item, this can stop when it finds that single item, but if there are multiple matches, it must test everything. This means that the number of operations in the average case is O(N) or linear time. Since databases may contain many objects, and since lookup is a common operation, it is often desirable to improve performance.
An index is any data structure that improves the performance of lookup. There are many different data structures used for this purpose. There are complex design trade-offs involving lookup performance, index size, and index-update performance. Many index designs exhibit logarithmic (O(log(N))) lookup performance and in some applications it is possible to achieve flat (O(1)) performance.
Policing the database constraints
Indexes are used to police database constraints, such as UNIQUE, EXCLUSION, PRIMARY KEY and FOREIGN KEY. An index may be declared as UNIQUE, which creates an implicit constraint on the underlying table. Database systems usually implicitly create an index on a set of columns declared PRIMARY KEY, and some are capable of using an already-existing index to police this constraint. Many database systems require that both referencing and referenced sets of columns in a FOREIGN KEY constraint are indexed, thus improving performance of inserts, updates and deletes to the tables participating in the constraint.
Some database systems support an EXCLUSION constraint that ensures that, for a newly inserted or updated record, a certain predicate holds for no other record. This can be used to implement a UNIQUE constraint (with equality predicate) or more complex constraints, like ensuring that no overlapping time ranges or no intersecting geometry objects would be stored in the table. An index supporting fast searching for records satisfying the predicate is required to police such a constraints.
索引类型
- 按数据唯一性区分:唯一索引(Unique Index)与非唯一索引
- 按索引的列的类型:主键索引(Primary Index)与辅助索引(Secondary Index)
- 按键列个数区分:单列索引(Single-column Index)与多列索引(Multi-column Index)
- 按存储结构区分:聚集索引(Clustered Index)与非聚集索引(Non-clustered Index)
唯一索引(Unique Index)
**唯一索引(Unique Indexes)**是不允许其中任何两行具有相同索引值的索引。唯一索引不止用于提升查询性能,还用于保证数据完整性。
当现有数据中存在重复的键值时,大多数数据库不允许将新创建的唯一索引与表一起保存。数据库还可能防止添加将在表中创建重复键值的新数据。
例如,如果在 employee 表中职员的姓 (lname) 上创建了唯一索引,则任何两个员工都不能同姓。对某个列建立UNIQUE索引后,插入新记录时,数据库管理系统会自动检查新纪录在该列上是否取了重复值:
CREATE UNIQUE INDEX index_name
on table_name (column_name);
主键索引(Primary Index)与辅助/二级索引(Secondary Index)
主键索引(Primary Index)
主键索引(Primary Index)简称为主索引,数据库表中一列或列组合(字段)的值唯一标识表中的每一行。该列称为表的主键。
在数据库关系图中为表定义主键将自动创建主键索引,主键索引是唯一索引的特定类型。该索引要求主键中的每个值都唯一。当在查询中使用主键索引时,它还允许对数据的快速访问。
对主键构造的索引,对应称为主键索引。
辅助/二级索引(Secondary Index)
辅助索引是相对于主键索引而言的,主键索引和辅助索引(Secondary key)的唯一区别是,主键索引要求索引列的值是唯一的,而辅助索引对索引列的值并没有要求,即索引列的值可以重复。
对非对主键列构造的索引,对应称为辅助/二级索引(Secondary Index)。
单列(Single-column)与多列索引(Multi-column Index)
普通/单列索引(Single-column Index)
最基本的索引类型,没有唯一性之类的限制,单列索引基于单一的字段创建,其基本语法如下所示:
CREATE INDEX index_name
ON table_name (column_name);
多列/联合/组合/复合/混合索引(Multi-column Index, Composite Index)
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
A multiple-column index can be considered a sorted array, the rows of which contain values that are created by concatenating the values of the indexed columns.
Ref
**多列索引(Multi-column Index)**指在表的多个字段组合上创建的索引,只有在查询条件中使用了这些字段的左边字段时,索引才会被使用。
假设有这样一个people表:
CREATE TABLE People(
PeopleId INT NOT NULL,
FirstName NVARCHAR(50) NOT NULL,
LastName NVARCHAR(50) NOT NULL,
Age INT NOT NULL,
PRIMARY KEY (PeopleId)
);
我们创建了一个名为 IX_StuID_StuName
的多列索引,其中包含表中的两列StuId和StuName。
下面是我们插入到这个people表的数据:
表中有四个人的FirstName为“Mikes”(这其中两个人的LastName为Sullivans,另外两个人的LastName为McConnells),有两个人的Age为17岁的人,还有一个名字与众不同的Joe Smith。 这个表的主要用途是根据指定的用户姓、名以及年龄返回相应的peopleid。
例如,我们可能需要查找FirstName为Mike、LastName为Sullivan、年龄17岁的用户的peopleid,SQL语句为
SELECT peopleid FROM people WHERE firstname='Mike' AND lastname='Sullivan' AND age=17;
首先,我们可以考虑在单个列上创建索引,比如firstname、lastname或者age列。
如果我们创建firstname列的索引,数据库将通过这个索引迅速把搜索范围限制到那些firstname=‘Mike’的记录,然后再在这个“中间结果集”上进行其他条件的搜索:它首先排除那些lastname不等于“Sullivan”的记录,然后排除那些age不等于17的记录。当记录满足所有搜索条件之后,数据库就返回最终的搜索结果。
由于建立了firstname列的索引,与执行表的完全扫描相比,MySQL的效率提高了很多,但我们要求MySQL扫描的记录数量仍旧远远超过了实际所需要的。虽然我们可以删除firstname列上的索引,再创建lastname或者age列的索引,但总地看来,不论在哪个列上创建索引搜索效率仍旧相似。
为了提高搜索效率,我们需要考虑运用多列索引。如果为firstname、lastname和age这三个列创建一个多列索引,数据库只需一次检索就能够找出正确的结果!
那么,如果在firstname、lastname、age这三个列上分别创建单列索引,效果是否和创建一个覆盖firstname、lastname、age字段的多列索引一样呢?
答案是否定的,两者完全不同。当我们执行查询的时候,MySQL只能使用一个索引。如果你有三个单列的索引,MySQL会试图选择一个限制最严格的索引。但是,即使是限制最严格的单列索引,它的限制能力也肯定远远低于覆盖firstname、lastname、age这三个列的多列索引。
最左前缀(Leftmost Prefixing)
多列索引还有另外一个优点,它通过称为**最左前缀(Leftmost Prefixing)**的概念体现出来。继续考虑前面的例子,现在我们有一个firstname、lastname、age列上的多列索引,我们称这个索引为fname_lname_age。当搜索条件是以下各种列的组合时,数据库将使用fname_lname_age索引:
- firstname,lastname,age
- firstname,lastname
- firstname
实际上,覆盖firstname、lastname、age列上的多列索引,相当于我们创建了(firstname,lastname,age)、(firstname,lastname)以及(firstname)这些列组合上的索引。因此,下面这些查询在执行时,均能够使用到这个fname_lname_age索引:
SELECT peopleid FROM people WHERE firstname='Mike' AND lastname='Sullivan' AND age='17';
SELECT peopleid FROM people WHERE firstname='Mike' AND lastname='Sullivan';
SELECT peopleid FROM people WHERE firstname='Mike';
下面这些查询在执行时,都不能够使用到这个fname_lname_age索引:
SELECT peopleid FROM people Where lastname='Sullivan';
SELECT peopleid FROM people Where age='17';
SELECT peopleid FROM people Where lastname='Sullivan' AND age='17';
为什么要使用联合索引
减少开销
建一个联合索引(col1,col2,col3),实际相当于建了(col1),(col1,col2),(col1,col2,col3)三个索引。每多一个索引,都会增加写操作的开销和磁盘空间的开销。对于大量数据的表,使用联合索引会大大的减少开销!
覆盖索引(covering index)
同理,当你要select的字段,已经在索引树里面存储,那就不需要再去对数据表进行回表了。
对联合索引(col1,col2,col3),如果有如下的SQL:
select col1,col2,col3 from test where col1=1 and col2=2
那么MySQL可以直接通过遍历索引表就可以取得待查询数据,而无需在遍历索引表后,又再在数据表中进行查询。
这减少了很多的随机I/O操作。减少I/O操作,特别的随机I/O其实是DBA主要的优化策略。所以,在真正的实际应用中,覆盖索引是主要的提升性能的优化手段之一。
效率高
索引列越多,通过索引筛选出的数据越少。有1000W条数据的表,有如下SQL:
select * from table where col1=1 and col2=2 and col3=3
假设假设每个条件可以筛选出10%的数据,如果只有单值索引,那么通过该索引能筛选出1000W*10%=100w条数据,然后再回表从100w条数据中找到符合col2=2 and col3= 3的数据,然后再排序,再分页;如果是联合索引,通过索引筛选出1000w*10%* 10% *10%=1w,效率提升可想而知!
聚集索引(Clustered Index)与非聚集索引(Non-clustered Index)
Refer to https://swsmile.info/post/db-clustered-nonclustered-index/
Reference
- https://en.wikipedia.org/wiki/Database_index
- https://en.wikipedia.org/wiki/Hash_table
- Designing Data Intensive Applications
- 数据库两大神器【索引和锁】 - https://juejin.im/post/5b55b842f265da0f9e589e79
- Mysql 索引精讲 - https://juejin.im/post/5ccfdb05e51d453b7f0a0d4f
- Mysql联合索引最左匹配原则 - https://segmentfault.com/a/1190000015416513