ZIP

Last modified by Kashif Iqbal on 2019/03/19 17:22

What is a ZIP file?

ZIP file extension represents archives that can hold one or more files or directories. The archive can have compression applied to the included files in order to reduce the ZIP file size. ZIP file format was made public back in February 1989 by Phil Katz for achieving archiving of files and folders. The format was made part of PKZIP utility, created by PKWARE, Inc. Right after the availability of available specifications, many companies made ZIP file format part of their software utilities including Microsoft (since Windows 7), Apple (Mac OS X ) and many others. 

Brief History

The history fo ZIP file format dates back to the event of lawsuite fielded by System Enhancement Associates (SEA) against PKWARE for using its ARC utility without permissions for its trademark and the copyrights of product's appearance and user interface. Prior to this, Phil Katz, had rewritten SEA's source code and released PKXARC, an ARC extractor, and PKARC, a file compressor, as freeware for MS-DOS based systems. Losing to the lawsuit, PKWARE couldn't use the anything related to ARC anymore. This is where the creation of a new file compression came into being, named as ZIP which was made part of PKZIP utility at PKWARE, Inc.

Katz released the ZIP file format specifications into the public domain, while retaining the proprietary rights over his compression and extraction utility i.e. PKZIP. The ZIP compression system was (and is) able to archive files in a folder by means of a 32-bit cyclic redundancy check (CRC) algorithm to compress file sizes. Unlike ARC, .ZIP folders included a directory file that played the role of a cryptographer's code book, holding the information necessary to render the compressed files.

Supported Compression Methods

As per .ZIP File Format specifications, the following compression methods are supported.

  • Store - implies no compression
  • Shrink
  • Reduction (This implies compression factors ranging from level 1 to level 4)
  • Implode
  • Deflate
  • Deflat64
  • BZIP2
  • LZMA (EFS)
  • WavPack
  • PPMd version I, Rev 1

DEFLATE is the commonly used compression method which is a lossless date compression algorithm that uses a combination of the LZ77 and Huffman coding and is detailed in RFC 1951.

ZIP File Format Specifications

ZIP files have capability to store multiple files using different compression techniques while at the same time supports storing a file without any compression. Each file is stored/compressed individually which helps to extract them, or add new ones, without applying compression or decompression to the entire archive. 

Overall ZIP File Format

Each Zip file is structured in the following manner:

Local File Header 1
File Data 1
Data Descriptor 1
Local File Header 2
File Data 2
Data Descriptor 2
      ....
      ....
Local File Header N
File Data N
Data Descriptor N
Archive Decryption Header
Archive Extra Data Record
Central Directory

ZIP file format uses 32-bit CRC algorithm for archiving purpose. In order to render the compressed files, a ZIP archive holds a directory at its end that keeps the entry of the contained files and their location in the archive file. It, thus, plays the role of encoding for encapsulating information necessary to render the compressed files. ZIP readers use the directory to load the list of files without reading the entire ZIP archive. The format keeps dual copies of the directory structure to provide greater protection against loss of data.

Each file in a ZIP archive is represented as an individual entry where each entry consists of a Local File Header followed by the compressed file data.The Directory at the end of archive holds the references to all these file entries. ZIP file readers should avoid reading the local file headers and all sort of file listing should be read from the Directory. This Directory is the only source for valid file entries in the archive as files can be appended towards the end of the archive as well. That is why if a reader reads local headers of a ZIP archive from the beginning, it may read invalid (deleted) entries as well those are not part of the Directory being deleted from archive.

The order of the file entries in the central directory need not coincide with the order of file entries in the archive.

ZIP File Entries

Entries in ZIP file are arranged one after another where each entry consists of:

  • Local File Header
  • Optional Extra Data Fields
  • User data (optionally compressed/optionally encrypted)

The Local File Header of each entry represents information about the file such as comment, file size and file name.  The extra data fields (optional) can accommodate information for extensibility options of the ZIP format.

Local File Header

The Local File Header has specific field structure consisting of multi-byte values. All the values are stored in little-endian byte order where the field length counts the length in bytes. All the structures in a ZIP file use 4-byte signatures  for each file entry. The end of central directory signature is 0x06054b50 and can be distinguished using its own unique signature. Following is the order of information stored in Local File Header.

OffsetBytesDescription
04Local file header signature = 0x04034b50 (read as a little-endian number)
42Version needed to extract (minimum)
62General purpose bit flag
82Compression method
102File last modification time
122File last modification date
144CRC-32
18

4

Compressed size
224Uncompressed size
262File name length (n)
282Extra field length (m)
30nFile Name
30+nmExtra Field

Central Directory File Header

OffsetBytesDescription
04Central directory file header signature = 0x02014b50
42Version made by
62Version needed to extract (minimum)
82General purpose bit flag
102Compression method
122File last modification time
142File last modification date
164CRC-32
204Compressed size
244Uncompressed size
282File name length (n)
302Extra field length (m)
322File comment length (k)
342Disk number where file starts
362Internal file attributes
384External file attributes
424Relative offset of local file header. This is the number of bytes between the start of the first disk on which the file occurs, and the start of the local file header. This allows software reading the central directory to locate the position of the file inside the ZIP file.
46nFile name
46+nmExtra field
46+n+mkFile comment

End of Central Directory Record

OffsetBytesDescription
04End of central directory signature = 0x06054b50
42Number of this disk
62Disk where central directory starts
82Number of central directory records on this disk
102Total number of central directory records
124Size of central directory (bytes)
164Offset of start of central directory, relative to start of archive
202Comment length (n)
22nComment

References

Created by kashifiqbal on 2018/10/17 15:49