Sanitize filename

zuki · June 24, 2020, 5:19pm

Hello please help

I have a folder that contains multiple pdf files. I am trying to create code that will help me sanitize these filenames using a whitelist. Can anyone help? This is the code I have thus far:

import string
import unicodedata
import os



valid_filename_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
char_limit = 255
os.chdir('dir')


def clean_filename(filename, whitelist=valid_filename_chars, replace=' '):

    for r in replace:
        filename = filename.replace(r,'_')
    
    cleaned_filename = unicodedata.normalize('NFKD', filename).encode('ASCII', 'ignore').decode()

    cleaned_filename = ''.join(c for c in cleaned_filename if c in whitelist)
    if len(cleaned_filename)>char_limit:
        print("Warning, filename truncated because it was over {}. Filenames may no longer be unique".format(char_limit))
   
    return cleaned_filename[:char_limit]   
clean_filename(filename)

Theodox · June 26, 2020, 4:36pm

first, a couple of qs:

Looks like you’re expecting unicode filenames?
And I’m guessing this is py3?
So you want spaces to become underscores and everything else not in your whitelist to go away?

A lot of people would do this with a regex, which is clunky in some ways but good for these kinds of things. However it’s often good to write the code for practice. In this case I guess your ascii normalization line will eliminate things like accents or cedillas or whatever, though I’m not sure you’ll always get back what you’re expecting thanks to the magic of unicode. But if you do that first, you now only have to account for a blacklist of unwanted ascii punctuation, which should be shorter than all of a-zA-Z0-9 + your literals.

To loop over the files in a folder, you would use os.listdir(), and to rename the files if necessary you’d use os.rename() . It’s a bit of extra work, but less trouble in the long run, not to use the current working directory. But to do something in any folder it’d look something likes this

def rename_folder_contents(folder):
     for filename in os.listdir(folder):
             new_name = clean_filename(filename)
             if new_name != filename:
                 full_old_path = os.path.join(folder, filename)
                 full_new_path = os.path.join(folder, new_name)
                 os.rename(full_old_path, full_new_path)

zuki · June 26, 2020, 3:30pm

Thank you.

But after running the code it only changes the file name for my file in the folder, the last file in the folder. Could there be a reason for this?

Theodox · June 26, 2020, 6:04pm

Doh, I’m editing the code to fix the bug – see revised version above

zuki · June 26, 2020, 6:04pm

Thank you. It worked