Replace all sound-like expressions in subtitles [updated!]
This article covers manual fixes to subtitles. You may consider some automatic tools instead.
I’m as big fan (maniac) of perfectly crafted movie subtitles as I’m a regular expression newbie (ignorant). I simply don’t understand them and I’m pretty scared of each attempt / need of using them. Mostly because most of such attempts fails! :]
Until today, my biggest problem about movie subtitles were “sound-like” expressions. Manual removal of 100+ lines out of 400+ subtitle files wasn’t and option. Today, I said to myself, that I’m going to sit by the computer until I don’t figure out a regular expression, which I can feed into Notepad++ and replace all of this junk-like (at least to me) text out of each of my movie subtitle.
So I did. Finding proper regular expression for such extremely easy task is a snap of fingers for every regular expression freak. For regexp-ignorant, like me, it took no more than five minutes, so I managed to get on time for diner back home.
Contents
A sound-like expressions
If you don’t know, what are these, then let me show you some examples:
[MEN SPEAKlNG lN FORElGN LANGUAGE]
[ALL GRUNTlNG]
[MEN SHOUTlNG]
[LAUGHS]Usually you can find such “elements” in subtitles prepared for hearing-impaired viewers. Since I am still of a quite good in listening, I find such “additions” simply distracting.
They have two common problems. They’re
- looking stupid (at least to me) — I can hear when someone is laughing, without a text telling me this,
- full of mistakes — notice all these l(small letter “l”) in place ofI(capital “i”) etc.
The last might be an effect of using poor OCR software on graphical subtitles rendered into DVDs. As DVD specification allows only graphical subtitles. Necessary to display ideograph-like letters (Japaneese, Chineese, Korean etc.), as there was no UTF-8, able to handle them, when DVD specification was created.
As I said in an introduction, manual removal of these lines were not an option and I had to hire regular expressions to get rid of them once and for good.
Please, not (as given at the end of this article) that all regular expressions shown in here are tested and used by me in Notepad++. These are 100% certified to work in this editor (since I am using them on nearly daily basis), but you may find them not working in your software or system.
Basic regular expression
To cut the long story short, let me tell you that proper regular expression for this task is as simple as:
\[(.)*\]i.e. match:
- opening square bracket plus
- any number of any character except newline plus
- closing square bracket.
If your subtitle file contains sound-like expressions in regular brackets, you should use instead:
\((.)*\)I have figured it out these using Regular Expressions Cheat Sheet and tested in Regex Tester.
More regular expressions
The above is often just too simple to catch all the sound-like expressions in subtitles.
Here are some modifications to above expression:
- \-\ \[(.)*\]— match expressions like- - [door closes](yeah, so idiots do that!),
- \-\ \[(.)*\]\x20— match expressions like above with additional space in the end to capture these sound-like expressions being part of a longer sentence, so to match something like- - [Pauline] Not for a while. A month.,
- \[(.)*\]\x20— as above, but to match in situation where sentence does not start with- -,
- \x20\[(.)*\]— to match something lije- For a smoke! [coughing].
The creativity of subtitles authors goes beyond my imagination so I may need to update this article again.
Roles
Another pain in my ass is adding unnecessary information about who is speaking, i.e.:
TANGO:
Come on, Cash!
GUARD:
Let's go, you're late!They’re again mostly used by hearing-impaired persons and again… I find them distracting and thus try to get rid of them from my subtitle files.
The quickest regular expression is to search for:
(.)*:\r\n
And replace it with empty string.
Next, we have to deal with roles mentioned in single line, so something like this:
TANGO: Know where you're going?
CASH: Absolutely!This time we use:
(.)*:\x20
And again replace it with empty string.
Poor OCR
Due to the fact that people, who created original DVD format are from Mars, they have figured out that it will be so cool, if subtitles will be stored on such disk (in such format) as… graphical files (bitmaps), not as a text. This causes that when someone is ripping an original DVD disk, they must perform OCR (optical character recognition). The same that you do when turning scanned piece of paper into editable document.
This OCR process is usually done in quick mode or based on a weak algorithm and causes that many l letters (small letter “l”) are used instead of I letter (capital “i”).
We can fix it the same way as other presented above, only this time we don’t need to use regular expressions. A simple find-and-replace will do the trick. All you need to know, is to what look for.
Some (but naturally not all) examples:
- l'm→- I'm
- l am→- I am
- l've→- I've
- l ha→- I ha(to match both have and has)
- ls→- Is(with space in the beginning; to match Is and Isn’t, but to not match calls etc.)
- lt→- It(same situation as above)
- lt's→- It's(to match It’s in the beginning of lines)
- ln→- In
- l→- I(notice space in the beginning and in the end)
Some other things to fix
Sometimes subtitle authors forgets that there must be a space between opening text in a dialogue:
234
00:19:50,816 --> 00:19:53,735
-Are you OK?
-I need to buy a battery.
235
00:19:53,944 --> 00:19:56,738
-Can we clear some space?
-Everything'll be OK.
236
00:19:56,947 --> 00:20:00,951
-We'll soon have you in the warm.
This is also very easily to be fixed:
- Replace -(single dash) with-(dash and a space)
- Replace - - >(notice opening and closing spaces) with-->.
First replace breaks “time arrows” as a result, the essential element of .srt files. Forgetting to fix them (second replace) causes your subtitle file invalid. I won’t be readable by most players.
How to do it in Notepad++?
The remaining part was to push this regex to Notepad++.
Here is a sample Replace dialog configuration for replacing all sound-like texts inside single subtitle file:

And here is the same dialog configured for fixing subtitles in all files at once:

Doing a batch-replace on all files at once is a certain risk, so — as you may see — I’m always performing such operation on a copy of all my subtitles.
Notice, that I’m replacing sound-like expressions with single space, not with an empty line and — the most important — I’m not removing entire subtitle parts containing them. This is because .srt format has each subtitle ordered using integer order. And only removing subtitles from end of each file is possible.
Foreword
If you wish to continue regular expressions topic then this article might be a good choice.