Part 1: autogenerating rough English-language subtitles for a foreign language TV show

  • I talk about choosing a new Russian-language TV show to watch
  • I vent about looking for some English-language subtitles
  • I talk about using autosub to make some subtitles
  • I talk about the quality of those subtitles

Choosing a show to watch

Yesterday we were looking for a new television show to watch. We recently finished a show called Better Than Us, which was pretty good. It's the second Russian-language show we've watched in the past few months.

The first was one called Fartsa, which I was prepared to like because it touches on some interesting cultural and economic Soviet history, but which ended up being either intentionally or unintentionally vapid in its conclusions. It honestly was annoying to watch to the end. (Alexander Tsekalo, one of the producers of this show, basically said that his production company Sreda operates under the assumption that viewers are OK with bad and boring shows, as long as the shows have high production value. I mean, whatever, I'm not saying that's not technically correct and I'm probably reading too much into it. Maybe we'll watch another Sreda show at some point. But this one was lame.)

Anyway, Better Than Us had an interesting premise, good production value as far as I can tell, and a plot I enjoyed following, so the next show to watch was directed by the same director, Andrey Junkovsky, called The Sweet Life.

Looking for subtitles

Unfortunately, I couldn't find subtitles for this show in any language. I really looked for a while. I found a lot of subtitles for Fellini's La Dolce Vita. l looked on the sites I normally go to for English subtitles, then I looked for Russian-language subtitles at subs.com.ru and subtitry.ru. I downloaded a .ts file from a Chinese site that appeared to have some Russian closed-captioning linked but ultimately didn't. Someone posted on reddit a year ago looking for subtitles for this same show, and people have posted looking for good Russian subtitle sources on different livejournals. Finally I gave up when I read this 2014 post on masterrussian.net:

You'll be hard-pressed to find Russian subtitles for Russian TV, only really popular Russian movies have subtitles (which you can easily find at subtitleseeker or subscene) or Hollywood films that are dubbed in Russian also have subtitles (although not always sync between the text and the dubbing). I doubt you'll find Russian subtitles for a Russian TV series (like интерны), I tried really hard and came up with nothing. Russia just doesn't bother with subtitles.

So anyway, I don't know how accurate that is, but that's when I stopped looking, since it was late and we wanted to watch the show.

making some subtitles

So I tried out the BingLingGroup's python autosub package again, the same one that I first tried out in this post

The movie file was in .mkv format, which is a multimedia container format. Cute sidenote, I just learned that the name "mkv"  kind of comes from "matryoshka," those Russian dolls that stack inside each other. 

The first try was basically a blind try:
$ autosub -i mov.mkv ru
autosub: error: unrecognized arguments: ru


As you can see, this didn't work. So I asked for some help, and I've copy-pasted the helpful information here:

$ autosub --help
...
Language Options:
  Options to control language.
   -S lang_code, --speech-language lang_code
                        Lang code/Lang tag for speech-to-text. Recommend using
                        the Google Cloud Speech reference lang codes. WRONG
                        INPUT WON'T STOP RUNNING. But use it at your own risk.
                        Ref: https://cloud.google.com/speech-to-
                        text/docs/languages(arg_num = 1) (default: None)
  -SRC lang_code, --src-language lang_code
                        Lang code/Lang tag for translation source language. If
                        not given, use langcodes to get a best matching of the
                        "-S"/"--speech-language". If using py-googletrans as
                        the method to translate, WRONG INPUT STOP RUNNING.
                        (arg_num = 1) (default: None)
  -D lang_code, --dst-language lang_code
                        Lang code/Lang tag for translation destination
                        language. Same attention in the "-SRC"/"--src-
                        language". (arg_num = 1) (default: None) 
...
List Options:
  List all available arguments.
...
-lsc [lang_code], --list-speech-codes [lang_code]
                        List all recommended "-S"/"--speech-language" Google
                        Speech-to-Text language codes. If no arg is given,
                        list all. Or else will list a group of "good match" of
                        the arg. Default "good match" standard is whose match
                        score above 90 (score between 0 and 100). Ref:
                        https://tools.ietf.org/html/bcp47 https://github.com/L
                        uminosoInsight/langcodes/blob/master/langcodes/__init_
                        _.py lang code example: language-script-region-
                        variant-extension-privateuse (arg_num = 0 or 1)
  -ltc [lang_code], --list-translation-codes [lang_code]
                        List all available "-SRC"/"--src-language" py-
                        googletrans translation language codes. Or else will
                        list a group of "good match" of the arg. Same docs
                        above. (arg_num = 0 or 1)
...
$ autosub -lsc
...
en-us             English (United States)
...
ru-ru             Russian (Russia)
...
What I took from this is that it might work if I specify the language being spoken with -S and specify a language I'd like the subtitles in with -D. This is kind of cool. I had been expecting to end up with Russian-language subtitles, but this option includes subtitle translation.

So this is the command that ultimately got us some subtitles:
$ autosub -i mov.mkv -S ru-ru -D en-us
It took about ten minutes to finish, since the episode was almost fifty minutes long.
...
Converting speech regions to short-term fragments.
Converting: 100% |#########################################################################################################| Time:  0:00:51

Sending short-term fragments to Google Speech V2 API and getting result.
Speech-to-Text: 100% |#####################################################################################################| Time:  0:12:51

Translating text from "ru" to "en".
Translation: 100% |########################################################################################################| Time:  0:00:32 
Destination language subtitles file created at "mov.en.srt".

All works done.

Results

The results were helpful.

We got some English-language subtitles with varied accuracy for probably a third of the speech in the show, and lyrics to part of a song that a character was listening to. It didn't transcribe the lyrics to any songs playing in the background when characters were speaking.

One thing I found interesting is that I don't think the caption groups reflected a change in speaker unless there were significant pauses between their speech. For example, here's one character Alexandra (on the right) asking Igor (on the left) what his favorite [ballet] is, and he responds "Black Swan."



As you can see, the caption groups are not split to indicate that there are two speakers. As far as I know, the standard formatting to indicate multiple speakers in one caption group is by using hyphens and putting them on separate lines, like so:



I don't know if the Google Speech API looks for changes in speakers or not.

Another thing I found interesting is that a curse word was asterix'd out:

Two people talking at a table with a subtilte that says "What the **** s start"

I wonder where words get flagged as curse words, and where they're getting asterix'd?

Conclusion

Anyway, that's all I have to say about this first try of making Engish-language subtitles for a Russian-language show. I think I'll make some more today for when we watch the next episode, and I'll maybe see how Russian subtitles come out instead of translating straight to English.

I posted the .srt file up to the github repo in case it's interesting or useful to anyone.

  • Does the earlier autosub package (from agermanidis' github) have the same translating options as BingLingGroup's autosub?
  • Does Google's Speech v2 API look for speaker changes?
  • Was there any Russian that wasn't translated to English?
  • Why are there blank caption groups in the .srt file? Is that a usual thing in .srt files?
  • Was there anything I could have tweaked to improve the quality of these subtitles?

Comments

Popular Posts