【Django】Django ORM - 性能优化

Posted by 西维蜀黍 on 2019-11-17, Last Modified on 2021-09-21

你可以使用 python manage.py shell 来让 ORM 执行查询以玩耍:

$ python manage.py shell
>>> from core import models
>>> a = models.User.objects[0:5]
...

如果你使用 MySQL,还可以结合General Log来实时查看 MySQL的 SQL 执行日志以进行验证(Visit 【MySQL】日志记录 for more details)。

最重要的原则: Work at the appropriate level

意思就是说要在对应的 level (M V C) 做对应的事。e.g. 如果计算 court, 在最低的数据库 level 里是最快的(如果只需要知道此记录是否存在的话,用 exists() 会更快)。 但要 注意 : queryset 是 lazy 的,所以有时候在 higher level (例如模板)里控制 queryset 是否真的执行,说不定会更高效。

下面这段代码很好的解释了不同 level 的意思:

# QuerySet operation on the database
# fast, because that's what databases are good at
my_bicycles.count()

# counting Python objects
# slower, because it requires a database query anyway, and processing
# of the Python objects
len(my_bicycles)

# Django template filter
# slower still, because it will have to count them in Python anyway,
# and because of template language overheads
\{\{ my_bicycles|length \}\}

Lazy 的 QuerySets

Background

这段代码看上去对数据库进行了三次查找,但其实只有在代码执行 print(q) 时,才真正向数据库请求执行 SQL 语句:

>>> q = Entry.objects.filter(headline__startswith="What")
>>> q = q.filter(pub_date__lte=datetime.date.today())
>>> q = q.exclude(body_text__icontains="food")
>>> print(q)

# ps. 上边的这种多条件查询, 官方推荐这种写法:
Entry.objects.filter(
    headline__startswith='What'
).exclude(
    pub_date__gte=datetime.date.today()
).filter(
    pub_date__gte=datetime(2005, 1, 30)
)

Internally, a QuerySet can be constructed, filtered, sliced, and generally passed around without actually hitting the database. No database activity actually occurs until you do something to evaluate the queryset.

Situations to let queryset to be evaluated

那么问题来了 , 既然 queryset 是 lazy 的,queryset 什么时候会被 evaluate 呢(使 ORM 向 DB 真正请求执行SQL)?

  1. Iteration,即对 Queryset 进行遍历循环的操作。

    for e in Entry.objects.all():
        print(e.headline)
    
  2. slicing:即 Entry.objects.all()[:5], 获取 queryset 中的前五个对象,相当于 sql 中的 LIMIT 5

    • 值得注意的是,Slicing an unevaluated QuerySet usually returns another unevaluated QuerySet, but Django will execute the database query if you use the “step” parameter of slice syntax, and will return a list. Slicing a QuerySet that has been evaluated also returns a list.
  3. picling/caching。

  4. repr/str:A QuerySet is evaluated when you call repr() on it. This is for convenience in the Python interactive interpreter, so you can immediately see your results when using the API interactively.

  5. len:A QuerySet is evaluated when you call len() on it. This, as you might expect, returns the length of the result list.

    • If you only need to determine the number of records in the set (and don’t need the actual objects), it’s much more efficient to handle a count at the database level using SQL’s SELECT COUNT(*). Django provides a count() method for precisely this reason.
  6. list():Force evaluation of a QuerySet by calling list() on it. For example:

    entry_list = list(Entry.objects.all())
    
  7. bool():Testing a QuerySet in a boolean context, such as using bool(), or, and or an if statement, will cause the query to be executed. If there is at least one result, the QuerySet is True, otherwise False. For example:

    if Entry.objects.filter(headline="Test"):
       print("There is at least one Entry with the headline Test")
    

以上的情况一旦发生,就会真正执行 SQL 来查询数据库,同时生成 cache (生成的 cache 就存在这个 queryset 对象之内的)。

注意,只有以上这 7 种情况,才会生成 cache。

**举个栗子: **

>>> queryset = Entry.objects.all()
>>> print([p.headline for p in queryset]) # Evaluate the query set.
>>> print([p.pub_date for p in queryset]) # Re-use the cache from the evaluation, rather than evaluate the queery set again.

注意!不会 cache 的情况: Specifically, this means that limiting the queryset using an array slice or an index will not populate the cache. 意思就是说 queryset [5] 和 queryset [:5] 是不会生成 cache 的。还有 exists()iterator() 这样的也不会生成 cache。

举个栗子:

>>> queryset = Entry.objects.all()
>>> print queryset[5] # Queries the database
>>> print queryset[5] # Queries the database again

>>> queryset = Entry.objects.all()
>>> [entry for entry in queryset] # Queries the database
>>> print queryset[5] # Uses cache
>>> print queryset[5] # Uses cache

Caching and QuerySets

Each QuerySet contains a cache to minimize database access. Understanding how it works will allow you to write the most efficient code.

In a newly created QuerySet, the cache is empty. The first time a QuerySet is evaluated – and, hence, a database query happens – Django saves the query results in the QuerySet’s cache and returns the results that have been explicitly requested (e.g., the next element, if the QuerySet is being iterated over). Subsequent evaluations of the QuerySet reuse the cached results.

Keep this caching behavior in mind, because it may bite you if you don’t use your QuerySets correctly. For example, the following will create two QuerySets, evaluate them, and throw them away:

>>> print([e.headline for e in Entry.objects.all()])
>>> print([e.pub_date for e in Entry.objects.all()])

That means the same database query will be executed twice, effectively doubling your database load. Also, there’s a possibility the two lists may not include the same database records, because an Entry may have been added or deleted in the split second between the two requests.

To avoid this problem, simply save the QuerySet and reuse it:

>>> queryset = Entry.objects.all()
>>> print([p.headline for p in queryset]) # Evaluate the query set.
>>> print([p.pub_date for p in queryset]) # Re-use the cache from the evaluation.

When QuerySets are not cached

Querysets do not always cache their results. When evaluating only part of the queryset, the cache is checked, but if it is not populated then the items returned by the subsequent query are not cached. Specifically, this means that limiting the queryset using an array slice or an index will not populate the cache.

For example, repeatedly getting a certain index in a queryset object will query the database each time:

>>> queryset = Entry.objects.all()
>>> print(queryset[5]) # Queries the database
>>> print(queryset[5]) # Queries the database again

However, if the entire queryset has already been evaluated, the cache will be checked instead:

>>> queryset = Entry.objects.all()
>>> [entry for entry in queryset] # Queries the database
>>> print(queryset[5]) # Uses cache
>>> print(queryset[5]) # Uses cache

Here are some examples of other actions that will result in the entire queryset being evaluated and therefore populate the cache:

>>> [entry for entry in queryset]
>>> bool(queryset)
>>> entry in queryset
>>> list(queryset)

一些结论

1 显式地声明你需要的字段

>>> a = models.User.objects.all()
>>> a[0].id
1

事实上,当执行到 a[0].id 时,以下 SQL 会被执行:

SELECT `core_user`.`id`, `core_user`.`name`, `core_user`.`password`, `core_user`.`salt`, `core_user`.`join_time`, `core_user`.`last_login_time`, `core_user`.`type` FROM `core_user` LIMIT 1

因此,使用 values_list() 或者 values()显式地指定你需要的字段。

假设只需要 id 字段的话,可以用 values_list ('id', flat=True)

2 当判断某条数据是否存在时,尽量使用 exists()

当执行以下代码时:

>>>test = models.User.objects.filter(id=1).exists()

以下SQL会被立刻触发执行:

SELECT (1) AS `a` FROM `core_user` WHERE `core_user`.`id` = 1  LIMIT 1

而且,当判断某条数据是否存在时,尽可能的使用primary key,比如:

entry = Entry.objects.get(pk=123)
if some_queryset.filter(pk=entry.pk).exists():
    print("Entry contained in queryset")

上面的方式明显会快于下面的方式:

entry = User(1, ...)
some_queryset = models.User.objects.all()
if entry in some_queryset: # evaluate the sql here
   print("Entry contained in QuerySet")

因为,下面的代码其实会将整个 User 都读取出来:

SELECT `core_user`.`id`, `core_user`.`name`, `core_user`.`password`, `core_user`.`salt`, `core_user`.`join_time`, `core_user`.`last_login_time`, `core_user`.`type` FROM `core_user`

And to find whether a queryset contains any items:

if some_queryset.exists():
    print("There is at least one object in some_queryset")

Which will be faster than:

if some_queryset:
    print("There is at least one object in some_queryset")

… but not by a large degree (hence needing a large queryset for efficiency gains).

因为some_queryset.exists()其实会触发如下 SQL:

SELECT (1) AS `a` FROM `core_user` WHERE `core_user`.`id` = 1  LIMIT 1

而后者会触发一个 SQL,这个 SQL 会将把 filter 中的条件映射到 where 语句中,然后把所有满足条件的数据都从 DB中读出来。

Additionally, if a some_queryset has not yet been evaluated, but you know that it will be at some point, then using some_queryset.exists() will do more overall work (one query for the existence check plus an extra one to later retrieve the results) than simply using bool(some_queryset), which retrieves the results and then checks if any were returned.

3 获取行的数量时,使用QuerySet.count()

If you only want the count, rather than doing len(queryset).

4 直接使用外键

If you only need a foreign key value, use the foreign key value that is already on the object you’ve got, rather than getting the whole related object and taking its primary key. i.e. do:

entry.blog_id

instead of:

entry.blog.id

5 批量插入

When creating objects, where possible, use the bulk_create() method to reduce the number of SQL queries. For example:

Entry.objects.bulk_create([
    Entry(headline='This is a test'),
    Entry(headline='This is only a test'),
])

…is preferable to:

Entry.objects.create(headline='This is a test')
Entry.objects.create(headline='This is only a test')

Note that there are a number of caveats to this method, so make sure it’s appropriate for your use case.

This also applies to ManyToManyFields, so doing:

my_band.members.add(me, my_friend)

…is preferable to:

my_band.members.add(me)
my_band.members.add(my_friend)

…where Bands and Artists have a many-to-many relationship.

一些测试

Test1

>>> a = models.User.objects.values("id")[0:5]
>>> for i in range(0,2):
...     a[i]["id"]

SQL 执行:

2019-11-17T13:15:18.454675Z	   35 Query	SELECT `core_user`.`id` FROM `core_user` LIMIT 1
2019-11-17T13:15:18.456044Z	   35 Query	SELECT `core_user`.`id` FROM `core_user` LIMIT 1 OFFSET 1

结论

  • 执行完 a = models.User.objects.values("id")[0:5] 语句后,并没有触发执行 SQL 语句;
  • 每次进入执行 a[i]["id"]时,都会真正触发执行一次 SQL 语句。

Test2

对于 Entry.objects.all():Entry.objects.filter(id=1)、切片(models.User.objects.all().values("id")[0:1]),只有到对 queryset 对象开始进行遍历的时候,ORM 才会请求 DB 触发执行 SQL。

Test3 - Debug 模式下的特殊情况

在断点调试模式下,由于每次为一个变量赋值后,都要显示出该变量的内容,这就导致QuerySets 的 Lazy Evaluation永远不会被触发。比如,

def test():
    users = models.User.objects.all()
    user = users[0]
    print "test"

按照上面的分析,只有执行到 user = users[0] 时,才会触发执行 SQL 语句。但是,在断点调试模式下,需要显示出 users 变量的值是什么,因此,在执行完 users = models.User.objects.all() 时,SQL 语句就会被触发执行。

以下为 SQL 执行日志:

2019-11-24T04:39:21.763123Z	   46 Connect	root@localhost on task2 using TCP/IP
2019-11-24T04:39:21.773622Z	   46 Query	SET AUTOCOMMIT = 0
2019-11-24T04:39:21.776080Z	   46 Query	SET SQL_AUTO_IS_NULL = 0
2019-11-24T04:39:21.783327Z	   46 Query	SET AUTOCOMMIT = 1
2019-11-24T04:39:21.789287Z	   46 Query	SELECT `core_user`.`id`, `core_user`.`name`, `core_user`.`password`, `core_user`.`salt`, `core_user`.`join_time`, `core_user`.`last_login_time`, `core_user`.`type` FROM `core_user` LIMIT 21

一些分析方法

django-debug-toolbar

原生的 explain 方法

>>> print(Blog.objects.filter(title='My Blog').explain(verbose=True))
Seq Scan on public.blog  (cost=0.00..35.50 rows=10 width=12) (actual time=0.004..0.004 rows=10 loops=1)
  Output: id, title
  Filter: (blog.title = 'My Blog'::bpchar)
Planning time: 0.064 ms
Execution time: 0.058 ms

connection.queries 方法

可以利用这两两句代码来分析你的代码的 sql 执行情况和花费时间:

from django.db import connection
connection.queries
>> [{'sql': 'SELECT polls_polls.id, polls_polls.question, polls_polls.pub_date FROM polls_polls',
     'time': '0.002'}]

from django.db import reset_queries
reset_queries()

Reference