I'm as big fan (maniac) of perfectly crafted movie subtitles as I'm a regular expression newbie (ignorant). I simply don't understand them and I'm pretty scared of each attempt / need of using them.
Until today, my biggest problem about movie subtitles were "sound-like" sentences. Manual removal of 100+ lines out of 400+ subtitle files wasn't and option. Today, I said to myself, that I'm going to sit by the computer until I don't figure out a regular expression, which I can feed into Notepad++ and replace all of this junk-like (at least to me) text out of each of my movie subtitle.
So I did. Finding proper regular expression for such extremely easy task is a snap of fingers for every regular expression freak. For regexp-ignorant, like me, it took no more than five minutes, so I managed to get on time for diner back home.
A sound-like sentences
If you don't know, what are these, then let me show you some examples:
[MEN SPEAKlNG lN FORElGN LANGUAGE] [ALL GRUNTlNG] [MEN SHOUTlNG] [LAUGHS]
They have two common problems. They're
- looking stupid (at least to me) -- I can hear when someone is laughing, without a text telling me this,
- full of mistakes -- notice all these
l(small letter "l") in place of
I(capital "i") etc.
The last might be an effect of using poor OCR software on graphical subtitles rendered into DVDs. As DVD specification allows only graphical subtitles. Necessary to display ideograph-like letters (Japaneese, Chineese, Korean etc.), as there was no UTF-8, able to handle them, when DVD specification was created.
As I said in an introduction, manual removal of these lines were not an option and I had to hire regular expressions to get rid of them once and for good.
Basic regular expression
To cut the long story short, let me tell you that proper regular expression for this task is as simple as:
- opening square bracket plus
- any number of any character except newline plus
- closing square bracket.
More regular expressions
I quickly learned that above regexp is just too simple for all the sound-like sentences that I can find in my subtitles.
Here are some modifications to above expression:
\-\ \[(.)*\]-- match sentences like
- [door closes](yeah, so idiots do that!),
\-\ \[(.)*\]\x20-- match sentences like above with additional space in the end to capture these sound-like phrases being part of a longer sentence, so to match something like
- [Pauline] Not for a while. A month.,
\[(.)*\]\x20-- as above, but to match in situation where sentence does not start with
\x20\[(.)*\]-- to match something lije
For a smoke! [coughing].
The creativity of subtitles authors goes beyond my imagination so I may need to update this article again.
The remaining part was to push this regex to Notepad++.
Here is a sample
Replace dialog configuration for replacing all sound-like texts inside single subtitle file:
And here is the same dialog configured for replacing all subtitles at once:
Doing a batch-replace on all files ant once is a certain risk, so -- as you may see -- I'm always performing such operation on a copy of all my subtitles.
Notice, that I'm replacing sound-like sentences with single space, not with an empty line and -- the most important -- I'm not removing entire subtitle parts containing them.
This is because
.srt format, I'm using, has each and every subtitle ordered using integer order and only removing subtitles from end of each file is possible.
If you wish to continue regular expressions topic then this article might be a good choice.