【Linux】I/O - Page Cache

Posted by 西维蜀黍 on 2021-11-02, Last Modified on 2023-10-12

Page Cache / Buffer Cache

In computing, a page cache, sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD) or a solid-state drive (SSD). The operating system keeps a page cache in otherwise unused portions of the main memory (RAM), resulting in quicker access to the contents of cached pages and overall performance improvements. A page cache is implemented in kernels with the paging memory management, and is mostly transparent to applications.

Usually, all physical memory not directly allocated to applications is used by the operating system for the page cache. Since the memory would otherwise be idle and is easily reclaimed when applications request it, there is generally no associated performance penalty and the operating system might even report such memory as “free” or “available”.

When compared to main memory, hard disk drive read/writes are slow and random accesses require expensive disk seeks; as a result, larger amounts of main memory bring performance improvements as more data can be cached in memory. Separate disk caching is provided on the hardware side, by dedicated RAM or NVRAM chips located either in the disk controller (in which case the cache is integrated into a hard disk drive and usually called disk buffer[3]), or in a disk array controller. Such memory should not be confused with the page cache.

Under Linux, the number of megabytes of main memory currently used for the page cache is indicated in the Cached column

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       2.2Gi       142Mi       1.0Mi       1.5Gi       1.4Gi
Swap:         2.0Gi        19Mi       2.0Gi

Memory conservation

Pages in the page cache modified after being brought in are called dirty pages.

Since non-dirty pages in the page cache have identical copies in secondary storage (e.g. hard disk drive or solid-state drive), discarding and reusing their space is much quicker than paging out application memory, and is often preferred over flushing the dirty pages into secondary storage and reusing their space. Executable binaries, such as applications and libraries, are also typically accessed through page cache and mapped to individual process spaces using virtual memory (this is done through the mmap system call on Unix-like operating systems). This not only means that the binary files are shared between separate processes, but also that unused parts of binaries will be flushed out of main memory eventually, leading to memory conservation.

Writing

If data is written, it is first written to the Page Cache and managed as one of its dirty pages. Dirty means that the data is stored in the Page Cache, but needs to be written to the underlying storage device first. The content of these dirty pages is periodically transferred (as well as with the system calls sync or fsync) to the underlying storage device. The system may, in this last instance, be a RAID controller or the hard disk directly.

Reading

File blocks are written to the Page Cache not just during writing, but also when reading files. For example, when you read a 100-megabyte file twice, once after the other, the second access will be quicker, because the file blocks come directly from the Page Cache in memory and do not have to be read from the hard disk again. The following example shows that the size of the Page Cache has increased after a good, 200-megabytes video has been played.

user@adminpc:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          3884       1812       2071          0         60       1328
-/+ buffers/cache:        424       3459
Swap:         1956          0       1956
user@adminpc:~$ vlc video.avi
[...]
user@adminpc:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          3884       2056       1827          0         60       1566
-/+ buffers/cache:        429       3454
Swap:         1956          0       1956
user@adminpc:~$

System Calls

fsync()

  • An application program makes an fsync() call for a specified file. This causes all of the pages that contain modified data for that file to be written to disk. The writing is complete when the fsync() call returns to the program.
  • As well as flushing the file data, fsync() also flushes the metadata information associated with the file.

fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.

fdatasync()

fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled.

The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.

sync()

sync() causes all pending modifications to filesystem metadata and cached file data to be written to the underlying filesystems.

  • An application program makes a sync() call. This causes all of the file pages in memory that contain modified data to be scheduled for writing to disk. The writing is not necessarily complete when the sync() call returns to the program.
  • A user can enter the sync command, which in turn issues a sync() call. Again, some of the writes may not be complete when the user is prompted for input (or the next command in a shell script is processed).
  • sync() causes all pending modifications to filesystem metadata and cached file data to be written to the underlying filesystems.

The sync() function shall cause all information in memory that updates file systems to be scheduled for writing out to all file systems.

The writing, although scheduled, is not necessarily complete upon return from sync().

syncfs()

syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd.

Summary

In summary:

  1. POSIX fsync(): “please write data for this file to disk”
  2. POSIX sync(): “write all data to disk when you get around to it”
  3. Linux sync(): “write all data to disk (when you get around to it?)”
  4. Linux syncfs(): “write all data for the filesystem associated with this file to disk (when you get around to it?)”
  5. Linux fsync(): “write all data and metadata for this file to disk, and don’t return until you do”

Usages

Database use

In order to provide proper durability, databases need to use some form of sync in order to make sure the information written has made it to non-volatile storage rather than just being stored in a memory-based write cache that would be lost if power failed. PostgreSQL for example may use a variety of different sync calls, including fsync() and fdatasync(), in order for commits to be durable. Unfortunately, for any single client writing a series of records, a rotating hard drive can only commit once per rotation, which makes for at best a few hundred such commits per second. Turning off the fsync requirement can therefore greatly improve commit performance, but at the expense of potentially introducing database corruption after a crash.

Databases also employ transaction log files (typically much smaller than the main data files) that have information about recent changes, such that changes can be reliably redone in case of crash; then the main data files can be synced less often.

Reference