Friday, December 2, 2011

Python boto aws s3 multipart uploads

Did you know that there are two ways to upload data into Amazon AWS S3?  Single Part and Multi Part uploads.  As the names suggest, the first method uploads the whole file in one shot.  The second method takes a collection of files that are sequences, uploads them separately, then when signalled that the upload is complete, assembles them into a single file.  This has a few advantages for larger files.  One can resume a failed upload by looking at the parts that are present and uploading the rest, or one can upload the parts in parallel, allowing the upload to complete faster.

What is not obvious or clear is that if a multi-part upload fails in the middle, is leaves a partially finished upload file (called a key) in S3 with a bunch of unassembled parts associated with it.  This makes sense as you could not resume a failed upload if the failed upload were automatically deleted.  However, the unfinished upload is not visible through the AWS console or the standard methods for listing files in S3 and you are charged for them, so it is useful to periodically look for them and remove them.

I discovered this while learning how to use boto in Python to interface with AWS, and much to my surprised discovered a number of failed 50GB uploads that were orphaned and costing me money.  I wrote a simple script (my first python script) to allow the enumeration and removal of the orphans.