Zipofig. A story of ZIP archive recovery
Recently I had a 2.7GB ZIP archive with 2857 files stored with “no compression” method. Each ZIP meant to be unzipped sooner or later and this one was not an exception. To my surprise, I failed when I tried to unzip it. Attempts with different versions of such different programs as WinZip, PowerArchiver, WinRar, PKZIP and few more were unsuccessful too. Some of them reported “a broken zip structure” and failed with GPF during repair attempt; some were able to unzip archive partially, pretending there are no more files in archive; some unzipped a whole archive but made a lot of zero-length files instead of normal-sized ones. However, all files were in archive indeed and I certainly wanted my data back.
Closer look at the ZIP’s binary reveal some abnormalities in the structure of archive. It was created with PKZIP 2.5 CLI for Windows 98/NT. There were no errors in structure but starting from the certain offset (0x 803e07f0 to be more precisely) all files were stored in so-called ‘non-seekable device’ or ‘bit-3 on’ mode with flags field set to 0.
A generic structure of regular ZIP file looks like follows:
[local file header1] [file data1] : [local file headern] [file datan] [central directory entry1] : [central directory entryn] [end of central directory record]There are zip64 and digital signature records may also exist in archive, but I’ll skip them here to stay clear. Whom, who interested in a whole ZIP format, please refer to [2].
A local file header record begins with signature 0×04034b50 and contains a bit flag field, a crc32 field, a compressed file size field, an original size field, a variable file name field plus few more fields. A central directory entry begins with signature 0×02014b50 and, actually, is again a file header. It contains all fields from a local file header plus few extra fields. End of central directory record begins with signature 0×06054b50 and contains a total number of entries in central directory, offset of start of central directory plus few more fields.
If output ZIP file was standard output or a non-seekable device then an additional data descriptor record used. A single file entry in this mode looks like follows:
[local file header] [file data] [data descriptor]The data descriptor record has no signatures and follows right after the file data. It has such fields as crc32, compressed file size, original file size. In this case the same fields in local file header should be set to 0 and bit 3 of bit flag field should be set to 1.
It is clear enough that PKZIP 2.5 was faced the problem with signed 32-bit value overflow during archive creation. At some point it was failed to properly seek a file pointer for output ZIP archive and then switched itself to the ‘non-seekable device’ mode (without bit 3 set to 1 in local file headers for some reasons).
The most obvious method to read ZIP files is something like:
open archive
read local file header
while (signature from local file header == 0x04034b50)
read (file name size from local file header) bytes of file name
skip next (extra field size + compressed file size) bytes
read local file header
end while
close archive
As you can see it will not work properly on files stored in ‘non-seekable device’ mode format, where compressed size field in local file header is 0. It will simply fail at the ’skip next’ line by do a jump to a wrong place instead of the next local file header. As a result, an archive will stop processing. Unfortunately, this approach is using by some programs and those of them lost part of ZIP archive just actually demonstrated the
described blunder. Other programs are using not so obvious but still lame methods by somehow combining access to the central directory and local headers. The results of these methods were GPFs or bunch of files with 0 bytes size, taken from local file headers.
Dealing with a “save yourself to be saved” situation, I decided to make an own tool to recover the data. The only problem was to find both a beginning and a size of each file within the archive.
The most obvious solution was to read the local file headers. If the compressed size, the original size and the crc32 fields are 0 then read data until next local file header signature appears and the file content will be that data minus 20 bytes (size of zip64 data descriptor record). Clear and elegant solution but there is the catch - there were the nested ZIP archives. In this case the local file header signature can appear before the actual end of processed file. It can be avoided with the different method: read data per 8 bytes with a 1-byte shift until this 8-bytes value equal to the number of bytes read already and a 4-bytes value after the next 8 bytes is the local file header signature. It is still the chances for an accidental match but probability is low. However I found this method rather slow.
I found it easier to go thru the central directory structure as it has all the necessary information already. All I need was to find the beginning of the central directory, not the beginning of each file. Instead of looking for the first occurrence of a central directory entry signature (remember the nested ZIPs?) the necessary value can be taken right from the “end of central directory” record. The final method itself is much easy to describe instead of explaination. Here it is:
open archive
from the end of archive to its beginning
find a first occurrence of the end of central directory record's signature
if found
read the end of central directory record
jump to the central directory first entry's offset, taken from the end of central directory
read central directory entry
while (signature from central directory entry==0x02014b50)
let X= local file header' offset from central directory entry
let X=X+ local file header size
let X=X + file name size from central directory entry
let Y= compressed size from central directory entry
let S= file name from central directory entry
if ( (Y==0) and (original size from central directory entry==0) and
(crc32 from central directory entry==0)) then make directory (S)
else save (Y) bytes from (X) offset to the (S) file
read central directory entry
end while
end if found
close archive
This method will work on any ZIP file and this is the actual method used in Zipofig - a tool I finally made. There were no compression methods applied on files in my particular case and this is the reason no additional efforts required to get an original file content. In other cases it is probably make sense to unpack these Y bytes with an appropriate uncompressing method before saving them to file. The uncompressing method can be selected, based on a compression method field from central directory entry.
Zipofig is a Win32 console utility and is distributing in source code. It was written and can be compiled with Visual C. Please refer the source code [1] for compatibility notes and compilation instructions. Zipofig has two modes: list contents of archive and extract files from archive. The only one uncompressing method supported - store a.k.a. zero level of compression a.k.a. no compression. Be my guest to add and implement whatever you like and whatever you need. Zipofig is not a pearl of programming art indeed but I do not care. It works and it really helped me while others disgraceful failed.
Perhaps you may wonder: wasn’t it easier to simply set to 1 the bit 3 of bit flag field of local file header for each file, stored with zero sizes and crc32 fields? Yes, it is, but there are the reasons:
- there were too many files for manual fix with a hex editor
- the problem with finding a local file header for each file still need to be solved anyway
- there was no guaranty that the programs, listed above, will be able
to unpack even a fixed archive and I had no desire to waste my time on it - to feed my ego and prove that I can do better than authors of brand software
And, I should say, the last one is the reason. Meantime I’ve got my data back - not a bad bonus indeed.
References:
[1] Zipofig C source code.
[2] PKWARE’s Application Note: The .ZIP file format.
Subscribe to RSS feed