Replace all sound-like expressions in subtitles [updated!]

I’m as big fan (maniac) of perfectly crafted movie subtitles as I’m a regular expression newbie (ignorant). I simply don’t understand them and I’m pretty scared of each attempt / need of using them. Mostly because most of such attempts fails! :]

Until today, my biggest problem about movie subtitles were “sound-like” expressions. Manual removal of 100+ lines out of 400+ subtitle files wasn’t and option. Today, I said to myself, that I’m going to sit by the computer until I don’t figure out a regular expression, which I can feed into Notepad++ and replace all of this junk-like (at least to me) text out of each of my movie subtitle.

So I did. Finding proper regular expression for such extremely easy task is a snap of fingers for every regular expression freak. For regexp-ignorant, like me, it took no more than five minutes, so I managed to get on time for diner back home.

A sound-like expressions

If you don’t know, what are these, then let me show you some examples:

[MEN SPEAKlNG lN FORElGN LANGUAGE]
[ALL GRUNTlNG]
[MEN SHOUTlNG]
[LAUGHS]

Usually you can find such “elements” in subtitles prepared for hearing-impaired viewers. Since I am still of a quite good in listening, I find such “additions” simply distracting.

They have two common problems. They’re

  • looking stupid (at least to me) — I can hear when someone is laughing, without a text telling me this,
  • full of mistakes — notice all these l (small letter “l”) in place of I (capital “i”) etc.

The last might be an effect of using poor OCR software on graphical subtitles rendered into DVDs. As DVD specification allows only graphical subtitles. Necessary to display ideograph-like letters (Japaneese, Chineese, Korean etc.), as there was no UTF-8, able to handle them, when DVD specification was created.

As I said in an introduction, manual removal of these lines were not an option and I had to hire regular expressions to get rid of them once and for good.

Please, not (as given at the end of this article) that all regular expressions shown in here are tested and used by me in Notepad++. These are 100% certified to work in this editor (since I am using them on nearly daily basis), but you may find them not working in your software or system.

Basic regular expression

To cut the long story short, let me tell you that proper regular expression for this task is as simple as:

\[(.)*\]

i.e. match:

  • opening square bracket plus
  • any number of any character except newline plus
  • closing square bracket.

If your subtitle file contains sound-like expressions in regular brackets, you should use instead:

\((.)*\)

I have figured it out these using Regular Expressions Cheat Sheet and tested in Regex Tester.

More regular expressions

The above is often just too simple to catch all the sound-like expressions in subtitles.

Here are some modifications to above expression:

  • \-\ \[(.)*\] — match expressions like - [door closes] (yeah, so idiots do that!),
  • \-\ \[(.)*\]\x20 — match expressions like above with additional space in the end to capture these sound-like expressions being part of a longer sentence, so to match something like - [Pauline] Not for a while. A month.,
  • \[(.)*\]\x20 — as above, but to match in situation where sentence does not start with -,
  • \x20\[(.)*\] — to match something lije For a smoke! [coughing].

The creativity of subtitles authors goes beyond my imagination so I may need to update this article again.

Roles

Another pain in my ass is adding unnecessary information about who is speaking, i.e.:

TANGO:

Come on, Cash!
GUARD:
Let's go, you're late!

They’re again mostly used by hearing-impaired persons and again… I find them distracting and thus try to get rid of them from my subtitle files.

The quickest regular expression is to search for:

(.)*:\r\n

And replace it with empty string.

Next, we have to deal with roles mentioned in single line, so something like this:

TANGO: Know where you're going?

CASH: Absolutely!

This time we use:

(.)*:\x20

And again replace it with empty string.

Poor OCR

Due to the fact that people, who created original DVD format are from Mars, they have figured out that it will be so cool, if subtitles will be stored on such disk (in such format) as… graphical files (bitmaps), not as a text. This causes that when someone is ripping an original DVD disk, they must perform OCR (optical character recognition). The same that you do when turning scanned piece of paper into editable document.

This OCR process is usually done in quick mode or based on a weak algorithm and causes that many l letters (small letter “l”) are used instead of I letter (capital “i”).

We can fix it the same way as other presented above, only this time we don’t need to use regular expressions. A simple find-and-replace will do the trick. All you need to know, is to what look for.

Some (but naturally not all) examples:

  • l'mI'm
  • l amI am
  • l'veI've
  • l haI ha (to match both have and has)
  •  ls Is (with space in the beginning; to match Is and Isn’t, but to not match calls etc.)
  •  lt It (same situation as above)
  •  ln In
  •  l  I  (notice space in the beginning and in the end)

Some other things to fix

Sometimes subtitle authors forgets that there must be a space between opening text in a dialogue:

This is also very easily to be fixed.

You need to replace - (single dash) with -  (dash and a space). The only thing left to remember is that this will also break a “time arrows”, the essential element of .srt files. And you need to fix them (reverse their change) by replacing - - >  (notice opening and closing spaces) with --> .

Otherwise your subtitle file become invalid and won’t be readable by most players.

How to do it in Notepad++?

The remaining part was to push this regex to Notepad++.

Here is a sample Replace dialog configuration for replacing all sound-like texts inside single subtitle file:

NPP Replace all sounds

And here is the same dialog configured for fixing subtitles in all files at once:

NPP Replace all sounds in all files

Doing a batch-replace on all files at once is a certain risk, so — as you may see — I’m always performing such operation on a copy of all my subtitles.

Notice, that I’m replacing sound-like expressions with single space, not with an empty line and — the most important — I’m not removing entire subtitle parts containing them. This is because .srt format has each subtitle ordered using integer order. And only removing subtitles from end of each file is possible.

Foreword

If you wish to continue regular expressions topic then this article might be a good choice.

Leave a Reply