Education
What is File Compression and How It Works?
A clear explanation of how file compression reduces size — from Huffman coding to ZIP archives — without requiring a computer science degree.
9 min read
The Core Idea Behind Compression
Compression is about finding and eliminating redundancy. Every file — whether it's a document, photo, or video — contains patterns that repeat. Compression algorithms detect these patterns and replace them with shorter representations.
Consider a simple example. The string "AAAAAABBBCC" is 11 characters long. A compression algorithm could encode it as "6A3B2C" — only 6 characters that carry the same information. This technique is called run-length encoding, and it's one of the simplest compression methods.
Real-world files contain more complex patterns. A photograph might have thousands of pixels with very similar (but not identical) colors. A text document repeats common words and phrases throughout. A spreadsheet might have columns of numbers that follow mathematical patterns. Compression algorithms are designed to find and exploit each of these types of redundancy.
The fundamental limit of compression is called information entropy — the theoretical minimum number of bits required to represent a piece of data without losing any information. Lossless compression approaches this limit but can never beat it. Lossy compression beats it by discarding information the algorithm judges to be expendable.
Lossless Compression: Preserving Every Bit
Lossless compression reduces file size without losing any data. When you decompress the file, you get back an exact copy of the original, bit for bit. This is essential for text documents, software, databases, and any file where a single changed bit would cause errors.
The most common lossless algorithms:
Huffman coding assigns shorter binary codes to frequently occurring symbols and longer codes to rare symbols. In English text, the letter 'e' might get a 3-bit code while 'z' gets a 12-bit code. The result: common characters take less space, and the overall file shrinks.
Lempel-Ziv-Welch (LZW) builds a dictionary of recurring sequences as it reads the file. When it encounters a sequence it's seen before, it replaces it with a short reference to the dictionary entry. GIF and early PDF compression use LZW.
DEFLATE combines Huffman coding with LZ77 (a sliding-window variant of Lempel-Ziv). It's the algorithm behind ZIP files, gzip, and PNG image compression. DEFLATE is the workhorse of the internet — virtually every web page, API response, and software download is compressed with it.
Typical lossless compression ratios: text files compress to 30–40% of their original size. Source code compresses to 20–30%. Already-compressed files (JPEG, MP3, MP4) barely shrink at all — maybe 1–2% — because they've already had their redundancy removed.
Lossy Compression: Sacrificing Perfection for Size
Lossy compression achieves dramatically smaller files by permanently discarding information the algorithm considers imperceptible to humans. You can't recover the discarded data — the decompressed file is an approximation of the original, not an exact copy.
For images, JPEG compression divides the image into 8×8 pixel blocks, converts each block from RGB color to a luminance-chrominance color space (humans are more sensitive to brightness than color), applies a discrete cosine transform (DCT) to represent the block as frequency components, and then aggressively quantizes the high-frequency components (fine details). The result: smooth gradients and large areas of similar color compress beautifully, while sharp edges and fine textures develop visible artifacts at high compression levels.
For audio, MP3 and AAC use psychoacoustic models that identify sounds humans can't hear — frequencies masked by louder nearby frequencies, sounds below the threshold of audibility, and stereo redundancy. These inaudible components are discarded or encoded at lower precision. A 50 MB WAV file becomes a 5 MB MP3 with quality that most listeners can't distinguish from the original.
For video, H.264 and H.265 exploit temporal redundancy — consecutive frames in a video are mostly identical. The codec encodes the differences between frames rather than each full frame, achieving compression ratios of 100:1 or higher.
Common Compression Formats
ZIP: The universal archive format. Uses DEFLATE compression. Supported natively by Windows, macOS, and Linux without additional software. Reasonable compression ratios and fast compression/decompression speed. The default choice when you need to share compressed files and want maximum compatibility.
GZIP: A single-file compression format using the same DEFLATE algorithm as ZIP. Ubiquitous in web servers (HTTP content-encoding: gzip) and Unix/Linux systems. Slightly better compression than ZIP for individual files because it doesn't carry the archive metadata overhead.
7Z: The 7-Zip archive format uses LZMA/LZMA2 compression, which produces smaller files than ZIP — typically 10–20% smaller for general content and up to 40% smaller for certain file types. The trade-off is slower compression speed (decompression is fast). Requires 7-Zip or compatible software to extract.
BZIP2: Uses the Burrows-Wheeler transform followed by Huffman coding. Better compression than GZIP for large files, but significantly slower. Common in Linux software distribution (tar.bz2 files).
ZSTD (Zstandard): A relatively new algorithm by Facebook that achieves compression ratios close to LZMA at speeds close to GZIP. Increasingly used in databases, file systems (Btrfs, ZFS), and content delivery.
RAR: A proprietary archive format with good compression and built-in error recovery. Popular for distributing large files split across multiple volumes. Requires WinRAR or compatible software.
Compression Ratios by File Type
How much compression you'll achieve depends entirely on the file type and its existing level of compression:
Text and documents: 60–80% reduction. Plain text, HTML, XML, CSV, and source code compress extremely well because they contain massive amounts of repeated patterns (common words, whitespace, tags, variable names).
Spreadsheets (XLSX): 10–30% reduction. XLSX is already a ZIP archive containing XML files. Compressing it again yields minimal gains because the content is already compressed.
PDF: 0–70% reduction, depending on content. A text-heavy PDF with uncompressed streams can shrink dramatically. A PDF full of JPEG images is already compressed and won't shrink further.
Images (uncompressed BMP/TIFF): 60–90% reduction with lossless PNG compression, or 90–98% with lossy JPEG compression.
Images (JPEG/PNG/WebP): 0–5% reduction. Already compressed. ZIP won't help.
Audio (WAV): 50–60% reduction with FLAC (lossless) or 80–90% with MP3 (lossy).
Audio (MP3/AAC): 0–2% reduction. Already lossy-compressed.
Video (raw): 95–99% reduction with H.264 encoding.
Video (MP4/MKV): 0–2% reduction. Already compressed.
The pattern: compression works best on raw, unprocessed data. Files that have already been through a compression step — which includes most modern file formats — don't compress further in any meaningful way.
When and How to Use Compression
Use compression when you need to:
• Send files via email — email attachment limits make compression essential for anything over 10 MB.
• Store archives of documents, code, or data that you access infrequently. Compressed archives save storage without losing information.
• Transfer files over slow or metered network connections. Compressing before transfer and decompressing after is faster than sending uncompressed data.
• Bundle multiple files for distribution. A single ZIP or 7Z archive is cleaner than a folder of loose files.
Don't bother compressing when:
• The files are already in a compressed format (JPEG, MP4, MP3, XLSX, DOCX). You'll waste CPU time for negligible size reduction.
• Speed is critical and the files are small. Compression adds processing time. For files under 100 KB on a fast network, sending them uncompressed is faster end-to-end.
• You need random access. Compressed archives typically require sequential decompression. If you need to read one file from an archive of thousands, some formats (ZIP) support random access, but others (tar.gz) require decompressing everything.
For day-to-day file conversion and compression, MagicConverters handles the common cases — PDF compression, image compression, format conversion, and archive creation — without requiring you to understand the underlying algorithms. Upload, convert, download.
file compressionhow compression worksdata compression explainedzip compressionlossless compressionfile compression types
Related Articles
Pillar guide
Ultimate Guide to Image Formats (Web, Print & Archives)
JPEG vs PNG vs WebP vs AVIF vs SVG — when to use each, how they affect quality and file size, and how to convert safely between formats.
Pillar guideImage Compression Deep Dive: Quality, Artefacts & Pipelines
How lossy and lossless compression work, what artefacts look like, choosing quality settings, and building a sane resize/compress pipeline for web and product media.
Pillar guideWeb Image Optimization for SEO (Core Web Vitals & Discoverability)
How image choices affect LCP, CLS, and crawling; alt text and captions; structured data; CDNs — tied back to measurable SEO outcomes.