The basic idea is that a file containing n bytes uses n bytes of disk space, plus a bit for some control information: the file's metadata (permissions, timestamps, etc.), and a bit of overhead for the information that the system needs to find where the file is stored. However there are many complications.
Think of each file as a series of books in a library. Smaller files make up just one volume, but larger files consist of many volumes, like an encyclopedia. In order to be able to locate the files, there is a card catalog which references every volume. Each volume has a bit of overhead due to the covers. If a file is very small, this overhead is relatively large. Also the card catalog itself takes up some room.
Going a bit more technical, in a typical simple filesystem, the space is divided in blocks. A typical block size is 4KiB. Each file takes up an integer number of blocks. Unless the file size is a multiple of the block size, the last block is only partially used. So a 1-byte file and a 4096-byte file both take up 1 block, whereas a 4097-byte file takes up two blocks. You can observe this with the
du command: if your filesystem has a 4KiB block size, then
du will report 4KiB for a 1-byte file.
If a file is large, then additional blocks are needed just to store the list of blocks that make up the file (these are indirect blocks; more sophisticated filesystems may optimize this in the form of extents). Those don't show in the file size as reported by
ls -l or GNU
du, which reports disk usage as opposed to size, does account for them.
Some filesystems try to reuse the free space left in the last block to pack several file tails in the same block. Some filesystems (such as ext4 since Linux 3.8 use 0 blocks for tiny files (just a few bytes) that entirely fit in the inode.
Generally, as seen above, the total size reported by
du is the sum of the sizes of the blocks or extents used by the file.
The size reported by
du may be smaller if the file is compressed. Unix systems traditionally support a crude form of compression: if a file block contains only null bytes, then instead of storing a block of zeroes, the filesystem can omit that block altogether. A file with omitted blocks like this is called a sparse file. Sparse files are not automatically created when a file contains a large series of null bytes, the application must arrange for the file to become sparse.
Two major features of very modern filesystems such as zfs and btrfs make the relationship between file size and disk usage significantly more distant: snapshots and deduplication.
Snapshots are a frozen state of the filesystem at a certain date. Filesystems that support this feature can contain multiple snapshots taken at different dates. These snapshots take room, of course. At one extreme, if you delete all the files from the active version of the filesystem, the filesystem won't become empty if there are snapshots remaining.
Any file or block that hasn't changed since a snapshot, or between two snapshots was taken exists identically in the snapshot and in the active version or other snapshot. This is implemented via copy-on-write. In some edge cases, it's possible that deleting a file on a full filesystem will fail due to insufficient available space — because removing that file would require making a copy of a block in the directory, and there's no more room for even that one block.
Deduplication is a storage optimization technique that consists of avoiding storing identical blocks. With typical data, looking for duplicates isn't always worth the effort. Both zfs and btrfs support deduplication as an optional feature.
dudifferent from the sum of the file sizes?
As we've seen above, the size reported by
du for each file is normally is the sum of the sizes of the blocks or extents used by the file. Note that by default,
ls -l lists sizes in bytes, but
du lists sizes in KiB, or in 512-byte units (sectors) on some more traditional systems (
du -k forces the use of kilobytes). Most modern unices support
ls -lh and
du -h to use “human-readable” numbers using K, M, G, etc. suffices (for KiB, MiB, GiB) as appropriate.
When you run
du on a directory, it sums up the disk usage of all the files in the directory tree, including the directories themselves. A directory contains data (the names of the files, and a pointer to where the file's metadata is), so it needs a bit of storage space. A small directory will take up one block, a larger directory will require more blocks. The amount of storage used by a directory sometimes depends not only on the files it contains but also the order in which they were inserted and in which some files are removed (with some filesystems, this can leave holes — a compromise between disk space and performance), but the difference will be tiny (an extra block here and there). When you run
ls -ld /some/directory, the directory's size is listed. (Note that the “total NNN” line at the top of the output from
ls -l is an unrelated number, it's the sum of the sizes in blocks of the listed items, expressed in KiB or sectors.)
Keep in mind that
du includes dot files which
ls doesn't show unless you use the
du reports less than the expected sum. This happens if there are hard links inside the directory tree:
du counts each file only once.
On some file systems like
ZFS on Linux,
du does not report the full disk space occupied by extended attributes of a file.
Beware that if there are mount points under a directory,
du will count all the files on these mount points as well, unless given the
-x option. So if for instance you want the total size of the files in your root filesystem, run
du -x /, not
If a filesystem is mounted to a non-empty directory, the files in that directory are hidden by the mounted filesystem. They still occupy their space, but
du won't find them.
When a file is deleted, this only removes the directory entry, not necessarily the file itself. Two conditions are necessary in order to actually delete a file and thus reclaim its disk space:
lsofon a mount point includes the processes that have a file open on that filesystem, even if the file is deleted.
root) can tell you which
loopdevices are currently set up and on what file. The loop device must be destroyed (with
losetup -d) before the disk space can be reclaimed.
If you delete a file in some file managers or GUI environments, it may be put into a trash area where it can be undeleted. As long as the file can be undeleted, its space is still consumed.
A typical filesystem contains:
Only the first kind is reported by
du. When it comes to
df, what goes into the “used”, “available” and total columns depends on the filesystem (of course used blocks (including indirect ones) are always in the “used” column, and unused blocks are always in the “available” column).
Filesystems in the ext2/ext3/ext4 reserve 5% of the space to the root user. This is useful on the root filesystem, to keep the system going if it fills up (in particular for logging, and to let the system administrator store a bit of data while fixing the problem). Even for data partitions such as
/home, keeping that reserved space is useful because an almost-full filesystem is prone to fragmentation. Linux tries to avoid fragmentation (which slows down file access, especially on rotating mechanical devices such as hard disks) by pre-allocating many consecutive blocks when a file is being written, but if there are not many consecutive blocks, that can't work.
Traditional filesystems, up to and including ext4 but not btrfs, reserve a fixed number of inodes when the filesystem is created. This significantly simplifies the design of the filesystem, but has the downside that the number of inodes needs to be sized properly: with too many inodes, space is wasted; with too few inodes, the filesystem may run out of inodes before running out of space. The command
df -i reports how many inodes are in use and how many are available (filesystems where the concept is not applicable may report 0).
tune2fs -l on the volume containing an ext2/ext3/ext4 filesystem reports some statistics including the total number and number of free inodes and blocks.
If a filesystem is mounted over the network (NFS, Samba, etc.) and the server exports a portion of that filesystem (e.g. the server has a
/home filesystem, and exports
df on a client reflects the data for the whole filesystem, not just for the part that is exported and mounted on the client.
As we've seen above, the total size reported by
df does not always take all the control data of the filesystem into account. Use filesystem-specific tools to get the exact size of the filesystem if needed. For example, with ext2/ext3/ext4, run
tune2fs -l and multiply the block size by the block count.
When you create a filesystem, it normally fills up the available space on the enclosing partition or volume. Sometimes you might end up with a smaller filesystem when you've been moving filesystems around or resizing volumes.
lsblk presents a nice overview of the available storage volumes. For additional information or if you don't have
lsblk, use specialized volume management or partitioning tools to check what partitions you have. On Linux, there's
pvs for LVM,
fdisk for traditional PC-style (“MBR”) partitions (as well as GPT on recent systems),
gdisk for GPT partitions,
disklabel for BSD disklabels, Parted, etc. Under Linux,
cat /proc/partitions gives a quick summary. Typical installations have at least two partitions or volumes used by the operating system: a filesystem (sometimes more), and a swap volume.
Finally, note that most computer programs use units based on powers of 1024 = 210 (because programmers love binary and powers of 2). So 1 kB = 1024 B, 1 MB = 1048576 B, 1 GB = 1073741824, 1 TB = 1099511627776 B, … Officially, these units are known as kibibyte KiB, mebibyte MiB, etc., but most software just reports k or kB, M or MB, etc. On the other hand, hard disk manufacturers systematically use metric (1000-based units). So that 1 TB drive is only 931 GiB or 0.904 TiB.