Never allow users to transfer files with non-Latin characters in name

I have received a PM message at Yii forum asking about possibility of transferring (via HTTP upload) files with non-English characters in filename. In this particular example, a Greek and Chinese (and English), but with option to support all the languages, that exists.

Conclusion: No f*ing way! But, if you wish to read something more about this, follow with this article.

Let me cross this topic in question and answer format.

The uploaded file succeed only with english character.
If I use:
$name_file = iconv('UTF-8', 'greek//TRANSLIT', $model->file->name);
$model->file->saveAs('documents/' . $model->name_file);
English and Greek succeed but not Chinese characters.

I don’t see anything wrong in here. You told (first line) PHP, that it should do conversion from Greek to UTF-8, so it was able to properly handle only English (no special characters) and Greek special characters filenames.

There is no way, this could handle Chinese or any other special characters. I would be very surprised, if that would work.

If you would use the same line, but gave Chinese code as second parameter for iconv, your script would handle English and Chinese characters without any problems, but would fail on Greek and any other. There’s seems to be some kind of logic in this.

How to make it for all languages?

First of all, storing files on server in any character set other than English is a complete madness!

Your PHP and Apache supports UTF-8, but your file system certainly not! You’ll end up with doubled files, files with incorrect filename, not-downloadable files etc. etc. You’re asking yourself for a real troubles.

Even, if you can assure, that your server’s (Linux?) file system is 100% UTF-8 ready and can write UTF-8 encoded filenames, HTTP upload protocol will cause another bunch of troubles (a piece of which you have already tasted), if you attempt to transfer files with non-English characters in names.

If you would like to support all the languages, with above (iconv) method, you would have to:

  • find the way to determine, in which language or alphabet filename of file is written (is it possible at all?),
  • transfer this language setting along with transmitted file,
  • set iconv second parameter according to transmitted value of detected language.

This is madness, let me underline this again!

For example, “pilot” is a word valid in English, Polish and probably many other languages. The same as “stop”. You can name hundreds of such examples. How you’re going to detect language of filename correctly in this case? Take some time and test Google Translate with language auto-detection option enabled, to see how often it made mistakes.

You can ask user to set language of his file’s filename, in the same form, where he is uploading file. But, what, if he made a wrong selection? This is even bigger madness.

My advice: the only reasonable solution here is to store filenames in English only and break file transfer, if you detect, that it contains non-English (non-Latin actually) characters in filename.

If you really need to support non-Latin character names, consider following solutions:

  • kill the project manager, that told you to code this,
  • tell the customer, that implementation of this will cost a hundred million dollars and hire someone to write a new Linux and a new PHP for you.

But, seriously… The only thing that comes to my mind is to consider letting users to transfer files directly via FTP (no HTTP file upload). Assuming that FTP protocol itself will allow you to transfer files with non-Latin characters in filenames. Then somehow bind files transferred this way with your application.

This is also a madness. Take some time and test Total Commander (which has quite good FTP client on-board) to see how many times it gets wako, if you try to upload or download any file with non-Latin filename.

I had so many problems with simple French “e” (they’ve got five different of them there, with and without accents pointed to left, right, etc.), which made may server to go completely wako and to generate two separate files (one with “e” with accent and one with “e” without accent) and to do a lot more stupid things.

Thus, I said then to myself: no f*ing way! Hell is going to froze earlier than I’m going to let users upload files with non-Latin characters in filenames.

Leave a Reply