getting text from image-based subtitles

I’ll try to expand on this post later, but just a quick note: today I watched an excellent movie (The Abacus and the Sword) and wanted to take some of the subtitles to use as SRS cards. But the subs were in the image-based .sub format, so Aegisub couldn’t handle it. The solution is to use Subtitle Edit. This doesn’t have a Japanese dictionary built in so you need to get it from the Tesseract project. Unpack that file to the \Tesseract\tessdata folder under Subtitle Edit’s install folder. Then when you open such a .sub file, it will ask you if you want to import it, then ask you what system to use (Tesseract) and what language (Japanese). Then you wait, some magic happens, and you have text. It marks the ones it isn’t sure of in a different colour so you can correct them manually, but it gets most of them pretty much right.

Advertisement
Published in: on 2012/01/09 at 06:05  Leave a Comment  

The URI to TrackBack this entry is: http://landorien.wordpress.com/2012/01/09/getting-text-from-image-based-subtitles/trackback/

RSS feed for comments on this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.