Random Musings: Python boto aws s3 multipart uploads

Did you know that there are two ways to upload data into Amazon AWS S3? Single Part and Multi Part uploads. As the names suggest, the first method uploads the whole file in one shot. The second method takes a collection of files that are sequences, uploads them separately, then when signalled that the upload is complete, assembles them into a single file. This has a few advantages for larger files. One can resume a failed upload by looking at the parts that are present and uploading the rest, or one can upload the parts in parallel, allowing the upload to complete faster.

What is not obvious or clear is that if a multi-part upload fails in the middle, is leaves a partially finished upload file (called a key) in S3 with a bunch of unassembled parts associated with it. This makes sense as you could not resume a failed upload if the failed upload were automatically deleted. However, the unfinished upload is not visible through the AWS console or the standard methods for listing files in S3 and you are charged for them, so it is useful to periodically look for them and remove them.

I discovered this while learning how to use boto in Python to interface with AWS, and much to my surprised discovered a number of failed 50GB uploads that were orphaned and costing me money. I wrote a simple script (my first python script) to allow the enumeration and removal of the orphans.

'''
ManageS3MultipartOrphans.py

This is a simple script to manage orphaned multipart uploads.  It has two
function definitions that can list or remove orphans in all buckets or a specified
bucket.  Check your buckets, I was suprised how many orphans I found.

Usage:
lists3orphans ('')
 Lists all buckets and any orphan uploads including their age in minutes.
 Does not show the individual parts or the total size of the orphan (though
 this would be fairly trivial to add is it were of interest.
 
lists3orphans ('bucketname')
 same as above but for a single bucket
 
removes3orphans ('', age)
 Same as lists3orphans but removes any orphans older then 'age' in minutes.
 
removes3orphans ('bucketname', age)
 Same as removes3orphans but removes any orphans for the specified bucket older then 'age' in minutes.
 
Note:
 The connection is made with the calling_format=OrdinaryCallingFormat() directive, allowing
 the enumeration of buckets with mixed case names which are not recommended by AWS and will
 raise errors in BOTO with the default connection, yet you can create them through the AWS console.
'''

import boto
from boto.s3.key import Key
from boto.s3.connection import OrdinaryCallingFormat
import time

def lists3singleorphans (bucket):
 from boto.s3.key import Key
 b = conn.get_bucket( bucket )
 if b is None: return
 print 'Checking: ' + b.name
 lmp = b.list_multipart_uploads()
 for mp in lmp:
  key = b.get_key ( mp.key_name )
  if key is None: return
  print key.name
  last_modified = ''
  for part in mp:
   if last_modified == '': last_modified = part.last_modified
   if last_modified <= part.last_modified: last_modified = part.last_modified
  print last_modified
  if last_modified != '': 
   print bucket + ': ' + key.name + ' ' + str ((time.time() - time.mktime(time.strptime(last_modified, '%Y-%m-%dT%H:%M:%S.000Z')))/60) + ' minutes old'

def removes3singleorphans (bucket, age):
 from boto.s3.key import Key
 b = conn.get_bucket( bucket )
 if b is None: return
 print 'Checking: ' + b.name
 lmp = b.list_multipart_uploads()
 if lmp is None: return
 for mp in lmp:
  key = b.get_key ( mp.key_name )
  if key is None: return
  print key.name
  last_modified = ''
  for part in mp:
   if last_modified == '': last_modified = part.last_modified
   if last_modified <= part.last_modified: last_modified = part.last_modified
  ageinminutes = (time.time() - time.mktime(time.strptime(last_modified, '%Y-%m-%dT%H:%M:%S.000Z')))/60
  if ageinminutes > age:
     print bucket + ': ' + key.name + ' ' + str (ageinminutes) + ' minutes old'
     print 'removing'
     key.delete()

def lists3orphans (bucket):
 if bucket != '':
  lists3singleorphans( bucket )
 else:
  b = conn.get_all_buckets( )
  for bucket in b:
   lists3singleorphans( bucket.name )
  
def removes3orphans (bucket, age):
 if bucket != '':
  removes3singleorphans( bucket, age )
 else:
  b = conn.get_all_buckets( )
  for bucket in b:
   removes3singleorphans( bucket.name, age )

bucketname = ''  
age = 1440

conn = boto.connect_s3(calling_format=OrdinaryCallingFormat())

# lists3orphans( bucketname )
# or
#removes3orphans ( bucketname, age )

Random Musings

Friday, December 2, 2011

Python boto aws s3 multipart uploads

No comments:

Post a Comment