Zipofig. A story of ZIP archive recovery

2003 November 30
by Ilya

Download Zipofig (.exe, 15k)

Recently I had a 2.7GB ZIP archive with 2857 files stored with “no compression” method. Each ZIP meant to be unzipped sooner or later and this one was not an exception. To my surprise, I failed when I tried to unzip it. Attempts with different versions of such different programs as WinZip, PowerArchiver, WinRar, PKZIP and few more were unsuccessful too. Some of them reported “a broken zip structure” and failed with GPF during repair attempt; some were able to unzip archive partially, pretending there are no more files in archive; some unzipped a whole archive but made a lot of zero-length files instead of normal-sized ones. However, all files were in archive indeed and I certainly wanted my data back.

Closer look at the ZIP’s binary reveal some abnormalities in the structure of archive. It was created with PKZIP 2.5 CLI for Windows 98/NT. There were no errors in structure but starting from the certain offset (0x 803e07f0 to be more precisely) all files were stored in so-called ‘non-seekable device’ or ‘bit-3 on’ mode with flags field set to 0.

A generic structure of regular ZIP file looks like follows:

	[local file header1]
	[file data1]
	:
	[local file headern]
	[file datan]
	[central directory entry1]
	:
	[central directory entryn]
	[end of central directory record]

There are zip64 and digital signature records may also exist in archive, but I’ll skip them here to stay clear. Whom, who interested in a whole ZIP format, please refer to [2].

A local file header record begins with signature 0×04034b50 and contains a bit flag field, a crc32 field, a compressed file size field, an original size field, a variable file name field plus few more fields. A central directory entry begins with signature 0×02014b50 and, actually, is again a file header. It contains all fields from a local file header plus few extra fields. End of central directory record begins with signature 0×06054b50 and contains a total number of entries in central directory, offset of start of central directory plus few more fields.

If output ZIP file was standard output or a non-seekable device then an additional data descriptor record used. A single file entry in this mode looks like follows:

	[local file header]
	[file data]
	[data descriptor]

The data descriptor record has no signatures and follows right after the file data. It has such fields as crc32, compressed file size, original file size. In this case the same fields in local file header should be set to 0 and bit 3 of bit flag field should be set to 1.

It is clear enough that PKZIP 2.5 was faced the problem with signed 32-bit value overflow during archive creation. At some point it was failed to properly seek a file pointer for output ZIP archive and then switched itself to the ‘non-seekable device’ mode (without bit 3 set to 1 in local file headers for some reasons).

The most obvious method to read ZIP files is something like:

open archive
read local file header
while (signature from local file header == 0x04034b50)
    read (file name size from local file header) bytes of file name
    skip next (extra field size + compressed file size) bytes
    read local file header
end while
close archive

As you can see it will not work properly on files stored in ‘non-seekable device’ mode format, where compressed size field in local file header is 0. It will simply fail at the ’skip next’ line by do a jump to a wrong place instead of the next local file header. As a result, an archive will stop processing. Unfortunately, this approach is using by some programs and those of them lost part of ZIP archive just actually demonstrated the
described blunder. Other programs are using not so obvious but still lame methods by somehow combining access to the central directory and local headers. The results of these methods were GPFs or bunch of files with 0 bytes size, taken from local file headers.

Dealing with a “save yourself to be saved” situation, I decided to make an own tool to recover the data. The only problem was to find both a beginning and a size of each file within the archive.

The most obvious solution was to read the local file headers. If the compressed size, the original size and the crc32 fields are 0 then read data until next local file header signature appears and the file content will be that data minus 20 bytes (size of zip64 data descriptor record). Clear and elegant solution but there is the catch - there were the nested ZIP archives. In this case the local file header signature can appear before the actual end of processed file. It can be avoided with the different method: read data per 8 bytes with a 1-byte shift until this 8-bytes value equal to the number of bytes read already and a 4-bytes value after the next 8 bytes is the local file header signature. It is still the chances for an accidental match but probability is low. However I found this method rather slow.

I found it easier to go thru the central directory structure as it has all the necessary information already. All I need was to find the beginning of the central directory, not the beginning of each file. Instead of looking for the first occurrence of a central directory entry signature (remember the nested ZIPs?) the necessary value can be taken right from the “end of central directory” record. The final method itself is much easy to describe instead of explaination. Here it is:

open archive
from the end of archive to its beginning
find a first occurrence of the end of central directory record's signature
if found
    read the end of central directory record
    jump to the central directory first entry's offset, taken from the end of central directory
    read central directory entry
    while (signature from central directory entry==0x02014b50)
        let X= local file header' offset from central directory entry
        let X=X+ local file header size
        let X=X + file name size from central directory entry
        let Y= compressed size from central directory entry
        let S= file name from central directory entry
        if ( (Y==0) and (original size from central directory entry==0) and
        (crc32 from central directory entry==0)) then make directory (S)
        else save (Y) bytes from (X) offset to the (S) file
        read central directory entry
    end while
end if found
close archive

This method will work on any ZIP file and this is the actual method used in Zipofig - a tool I finally made. There were no compression methods applied on files in my particular case and this is the reason no additional efforts required to get an original file content. In other cases it is probably make sense to unpack these Y bytes with an appropriate uncompressing method before saving them to file. The uncompressing method can be selected, based on a compression method field from central directory entry.

Zipofig is a Win32 console utility and is distributing in source code. It was written and can be compiled with Visual C. Please refer the source code [1] for compatibility notes and compilation instructions. Zipofig has two modes: list contents of archive and extract files from archive. The only one uncompressing method supported - store a.k.a. zero level of compression a.k.a. no compression. Be my guest to add and implement whatever you like and whatever you need. Zipofig is not a pearl of programming art indeed but I do not care. It works and it really helped me while others disgraceful failed.

Perhaps you may wonder: wasn’t it easier to simply set to 1 the bit 3 of bit flag field of local file header for each file, stored with zero sizes and crc32 fields? Yes, it is, but there are the reasons:

  • there were too many files for manual fix with a hex editor
  • the problem with finding a local file header for each file still need to be solved anyway
  • there was no guaranty that the programs, listed above, will be able
    to unpack even a fixed archive and I had no desire to waste my time on it
  • to feed my ego and prove that I can do better than authors of brand software

And, I should say, the last one is the reason. Meantime I’ve got my data back - not a bad bonus indeed.

References:

[1] Zipofig C source code.
[2] PKWARE’s Application Note: The .ZIP file format.

Reddit this / Add to del.icio.us / Digg this!
7 Comments leave one →
2006 November 22
Sergej Howeiler permalink

Link does not work!

2006 November 22
Sergej Howeiler permalink

source code can be downloaded, bat the program no

2006 November 22
Ilya permalink

Yeah, fixed. Thanks, Sergej :)

2007 May 12
spike permalink

Hi , i was seeking a while for an application like yours but the problem is that i’m under linux and it’s impossible to compile your code because of use of non standard functions ( like _telli64 ) can u try to help me ?

2007 May 14

spike, you may try to define _LARGEFILE64_SOURCE and use
__int64 -> off64_t
_lseek64 -> lseek64
_telli64 -> lseek64(file, 0, SEEK_CUR)
_read -> read
_close -> close
Something like this may work.

2007 August 24
Dave permalink

(I’m writing a zip extractor and stumbled over here)

Watch out:

if ( (Y==0) and (original size from central directory entry==0) and (crc32 from central directory entry==0)) then make directory (S)

If you have any zero-length files in the archive, this will create a directory where a file should be.

I’m not sure what the accepted solution is, but I’m currently checking to see if the entry name ends in a slash character (which seems to be a common convention). For unix, the info-zip extensions may store the D bit as well, but I haven’t looked into it that far yet.

2007 August 24

Hi Dave,

Certainly, you are correct that a directory will be created instead of a file. The proper approach in general is to check the external file attribute field. Although this field might be set to zero if a file came from a standard input (and it was zero in my case). So I found the above solution to be an acceptable compromise. After all, I was not making ultimate software: Zipofig was just a quick hack to recover my data :)

Leave A Comment

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS