奇怪的“BadZipfile:错误的CRC-32”问题

这段代码简化了Django应用程序中的代码,该应用程序通过HTTP多部分POST接收上传的zip文件,并对内部数据进行只读处理:

#!/usr/bin/env python

import csv, sys, StringIO, traceback, zipfile
try:
    import io
except ImportError:
    sys.stderr.write('Could not import the `io` module.\n')

def get_zip_file(filename, method):
    if method == 'direct':
        return zipfile.ZipFile(filename)
    elif method == 'StringIO':
        data = file(filename).read()
        return zipfile.ZipFile(StringIO.StringIO(data))
    elif method == 'BytesIO':
        data = file(filename).read()
        return zipfile.ZipFile(io.BytesIO(data))


def process_zip_file(filename, method, open_defaults_file):
    zip_file    = get_zip_file(filename, method)
    items_file  = zip_file.open('items.csv')
    csv_file    = csv.DictReader(items_file)

    try:
        for idx, row in enumerate(csv_file):
            image_filename = row['image1']

            if open_defaults_file:
                z = zip_file.open('defaults.csv')
                z.close()

        sys.stdout.write('Processed %d items.\n' % idx)
    except zipfile.BadZipfile:
        sys.stderr.write('Processing failed on item %d\n\n%s' 
                         % (idx, traceback.format_exc()))


process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))

很简单.我们在zip文件中打开zip文件和一个或两个CSV文件.

奇怪的是,如果我使用一个大的zip文件(~13 MB)运行它并让它从StringIO.StringIO或io.BytesIO实例化ZipFile(也许除了普通文件名之外的其他东西?我在Django中遇到类似的问题app尝试从TemporaryUploadedFile创建一个ZipFile,甚至是通过调用os.tmpfile()和shutil.copyfileobj()创建的文件对象,并让它打开两个csv文件而不是一个,然后它在处理结束时失败.这是我在Linux系统上看到的输出:

$./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

$./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.

$./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.

顺便提一下,代码在相同条件下失败但在我的OS X系统上以不同的方式失败.它似乎读取了损坏的数据并且变得非常困惑,而不是BadZipfile异常.

这一切都告诉我,我在这段代码中做了一些你不应该做的事情 – 例如:在一个文件上调用zipfile.open,同时已经在同一个zip文件对象中打开了另一个文件?这在使用ZipFile(filename)时似乎不是问题,但是在将ZipFile传递给类似文件的对象时可能会有问题,因为zipfile模块中有一些实现细节?

也许我在zipfile文档中遗漏了一些东西?或许它还没有记录?或者(最不可能),zipfile模块中的一个错误?

我可能刚刚发现问题和解决方案,但不幸的是我不得不用我自己的黑客(在这里称为myzipfile)替换Python的zipfile模块.

$diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py        2011-04-11 11:51:59.000000000 -0700
@@ -5,6 +5,7 @@
 import binascii, cStringIO, stat
 import io
 import re
+import copy

 try:
     import zlib # We may need its compression method
@@ -877,7 +878,7 @@
         # Only open a new file for instances where we were not
         # given a file object in the constructor
         if self._filePassed:
-            zef_file = self.fp
+            zef_file = copy.copy(self.fp)
         else:
             zef_file = open(self.filename, 'rb')

标准zipfile模块中的问题是,当传递文件对象(不是文件名)时,它会对每次调用open方法使用相同的传入文件对象.这意味着在相同的文件上调用tell和seek,因此尝试在zip文件中打开多个文件会导致文件位置被共享,因此多个打开调用会导致它们相互踩踏.相反,当传递文件名时,open会打开一个新的文件对象.我的解决方案适用于传入文件对象而不是直接使用该文件对象的情况,我创建了它的副本.

对zipfile的这一更改修复了我看到的问题:

$./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.

$./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.

$./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

但我不知道它是否对zipfile有其他负面影响……

编辑:我刚刚在Python文档中发现了这一点,我之前曾忽略过这一点.在http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open,它说:

Note: If the ZipFile was created by passing in a file-like object as the first argument to the
constructor, then the object returned by open() shares the ZipFile’s file pointer. Under these
circumstances, the object returned by open() should not be used after any additional operations
are performed on the ZipFile object. If the ZipFile was created by passing in a string (the
filename) as the first argument to the constructor, then open() will create a new file object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.

相关文章
相关标签/搜索