Extracting Hardsubs

Extracting Hardsubs
Introduction
For many aspiring re-releasers, one of the biggest obstacles is working with
older (mainly pre-2006) anime whose original fansubs were 100% hardsubbed.
Let's face it, any newer anime likely has softsubs easily available, and is going
to be covered by half a dozen Blu-Ray encoding/raw-remuxing groups. Older
hardsubbed series represent the greatest percentage of potential re-release
projects, in addition to being the series most in need of updated versions with
better audio/video quality. And if one isn't doing a full line-by-line retiming,
edit, and translation check, obtaining scripts from hardsubbed videos is the
most difficult and time-consuming part of the process.
Since most of the projects I work on involve series whose original fansubs were
100% hardsubbed, I thought I'd share a few tips and tricks on the optimal ways
of obtaining these scripts, and compare some advantages and disadvantages
of each.
These guides assume you can obtain and learn the basic functions of Aegisub.
Also, the Aegisub numerical values assume that you're putting these subs on a
480p DVD encode. Adjust them accordingly for higher-resolution sources.
Method Zero: Obtaining Scripts Directly
If the original fansub group is still active, contact them through standard
channels (website, forum, e-mail, IRC), explain who you are and what you're
doing, and ask nicely for the scripts for the show in question. If the group has
disbanded, check the staff credits in the video (most older fansubs will have
them), and track down the individuals via IRC, AnimeSuki, or MAL.
This can be the easiest method, but it also has a low probability of success.
While styled softsubs first became theoretically possible in 2002 or 2003
with .ogm + external .ssa releases (as seen with some R1 DVD-rips by Anime-
HQ), many groups were protective of their scripts and used hardsubbing as a
way to prevent people from... well, doing exactly what we're doing here. So
some fansubbers/groups may not be willing to share scripts. And even if they
1 | Page
are, the scripts may have been lost to FTP or HDD crashes, and thus no longer
exist anywhere outside of the hardsubbed videos.
Advantages:
* If original fansub scripts can be obtained, it effectively turns your hardsub ->
softsub project into a softsub -> softsub project, eliminating the need to obtain
scripts manually.
* Ensures that no errors will be introduced, unless you do further editing to the
scripts.
Disadvantages:
* Reduced chances of success, since it relies on others having the scripts and
being willing to share them.
* Since many groups used After Effects (AFX) for their typesetting and karaoke,
the scripts you receive might only have the dialogue, forcing you to redo signs
and songs yourself.
Method One: Optical Character Recognition (OCR):
OCR is a method that scans images, finds patterns of pixels of certain colors
grouped into shapes, and interprets those shapes as text based on user input.
For this method, you will need SubRip or an equivalent tool. The process of
using SubRip on hardsubbed videos is complicated enough to require its own
guide. I highly suggest reading this one, as it's how I learned to do it. This
section will mainly be supplemental tips for that guide.
The first step is to decide, "Are these subs OCR-able?" In general, subs that use
all one color, or have only a few similar colors, will be the most OCR-able.
Simpler, san-serif fonts are also better, as serifs and other ornamentation will
cause problems for Subrip. Subtitles that use many different colors, like
different text or outline colors for each character, will not be OCR-able.
Examples of the latter include a.f.k.'s Full Moon, Static-Subs' My-Hime/My-
Otome, Ryoumi's Tonagura!, and Anime-Keep's Yumeria.
Some tips the supplement the linked SubRip guide:
2 | Page
1) Before starting in with Subrip, open one of the TV-fansubs in Aegisub. Use
Aegisub's Color Picker to determine the color hex values of the subtitles. Now
when you're in SubRip, you can manually enter those values if the automatic
color detection doesn't get the right colors. As the guide suggests, try to find a
2-line sub, and have SubRip scan a 2-line area. Get a sense of how wide the
subtitles extend in the image, and set the scanning rectangle wide enough so
that characters at the beginning and ends of sentences don't get cut off. For
many fansubs, this will be close to the entire width of the image. If the original
subs have few or no 2-line subs, you can try a 1-line rectangle to reduce the
scanning area, thus reducing the amount of non-text marks SubRip will detect.
Be prepared to use "Manual Entry" in SubRip if 2-liners do appear. Under "1-
line settings", if SubRip perfectly reads the bottom line of a 2-line sub without
any user input, it will miss the top line completely.
2) Use the color values to set SubRip's sensitivity settings. For instance, if you
find that your outline's values are Red 4, Green 78, and Blue 121, unclick the
"move all values together" option and change the outline sensitivity to reflect
the differences in those numbers, i.e. Red 34, Green 108, Blue 150. Don't
hesitate to pause the OCR scanning and adjust those numbers up or down with
"move all values" re-enabled if the detection isn't working right. Generally,
bright scenes where subtitles overlay images with colors similar to the subtitles
will require less sensitivity. Un-checking the "outline" button can also be
helpful. It's hard to go into here, but you have to get used to playing with the
settings to suit the scene, show, and subs you're OCRing. Your goal is to have
SubRip showing only the text and as few extraneous dots/marks as possible.
3) While running the OCR and inputting characters, make sure not to make
typos (duh). For random non-text marks, press Enter to "ignore" them. Resist
the temptation to press SpaceBar+Enter to mark them as blank spaces, even if
they appear in a space between two letters. Chances are, you'll later see words
broken by spaces if SubRip sees a similar mark between two letters within a
word. Automatic spell-checkers like Aegisub's have a much easier time
correcting missing spaces than additional ones. I recommend pausing SubRip
and skipping over OP/ED songs, since karaoke subs are often placed differently
from dialogue subs, use weird fonts, and have funky effects -- all of these will
confuse SubRip and slow you down. Just go back later with Aegisub to
3 | Page
manually retime and type those yourself.
4) While SubRip's automatic correction can fix some errors, you'll need to do
further correction. Open the resulting .srt in Aegisub, and load the hardsubbed
video. Move all the new subs so that they appear above the old ones, either
with Styles Manager or Select All + Margin Override. Run spell-check, and add
all common names and series-specific terms to the custom dictionary. Use
"Replace All" for OCR mistakes like I'II instead of I'll. Aegisub's Find+Replace
can also be useful in correcting OCR errors not noticed by spellcheck. Go
through all the subs line-by-line to check punctuation and consistency with the
original subs, as SubRip often omits or adds periods or other punctuation
marks. Of course, if the original subs had spelling errors or other typos, fix
those as well!
5) While SubRip's "sins of comission" should be fixed at this point, there are
still omissions to consider. Open the original hardsubbed video in a media
player, and have your script open in Aegisub. Fastforward through the video,
taking note of any relevant onscreen text, relevant insert songs you want to
include, and missing dialogue. Obviously, SubRip is limited to that rectangle, so
it won't catch most onscreen text, dialogue at the top of the screen, or 3-line
subs, e.g. Person A's speech has 2 lines of text, Person B interrupts and adds a
3rd line. And depending on your settings and the original fansubs' timing,
SubRip may miss short lines like What? / But... / Huh? / I... / Sorry. / etc. Use
"Insert before/after..." in Aegisub to add the missing content in the proper
places, and retime them when you do your retiming/shifting for the new video
source. Once you've created karaoke files, copy+paste the song lyrics in, and
timeshift them if necessary.
Advantages:
* For subs with high "OCR-ability," SubRip can run fairly quickly and
automatically, once text and color settings are optimized and a character
matrix is established.
* Only a small percentage of the hardsubbed text actually needs to be entered
by you, thus allowing you to multitask. (I prefer to throw on an English-dubbed
rewatch anime on my adjacent TV.)
* OCR is the best way to automatically get timings for shows where no timed
4 | Page
scripts (in any language, see below) are available.
Disadvantages:
* Requires a LOT of work just to learn and set up.

* Does not work on all varieties of hardsubs.
* Inevitably introduces many errors and omissions, which must be manually
corrected.
* Does not detect "specialty" text outside a given area, so even if it works
perfectly, some manual retyping and retiming is still necessary.
* Depending on difficulty and PC speed, will likely take 2x an episode's run
time to scan a single episode. May take longer.
Method Two: Transcription
First off, some may be under the misconception that transcription involves
watching hardsubbed videos, pausing every few seconds to type things into a
text file, and then retiming everything from scratch. This is not necessary, as
you can do everything from within Aegisub, without the need to switch
between applications.
Transcription from timed, non-English subs:
Chances are, the show in question probably doesn't have English softsubs
available. (And if it does, just use those and edit/[de-]localize them to your
preferences.) However, there's a good chance that subbers over in Russia have
subbed the show and based their scripts on the most-respected English
hardsubs. So head over to Subs.RU, search for the show you want, and grab
the RU subtitle archive. Now, it's time to get things set up.
1) If there are any "Readme" or "Comments" text files, feed them into Google
Translate. If there are multiple English hardsubbed versions, the text files may
say which version the Russian subbers used. Next, open one of the .ass or .srt
files. Use Find+Replace to replace \N with a blank space. Then, copy roughly
half the lines, and paste them into Google Translate. Copy the resulting
translation, and use Paste Over (Shift+Ctrl+V) in Aegisub to paste the auto-
translated English text over the Russian. Repeat with the second half of the
script. (Trying to do the whole script at once will run over Google-TL's length
5 | Page
limits, and cut off numerous lines. \N's will also cause "\N[some RU word] to
appear in the translated text.)
-- Also, this can work with scripts in other non-English languages, mainly from
European ones that also translate to their native languages from the English
fansubs. Subs in other Asian languages like Chinese can work. However, I don't
recommend them as they'll be original JP->CN translations, and their timings
likely won't mesh well with the English subs.
2) Once you have a script of semi-comprehensible English text, select all lines
and do the following:
* In the Shift Times dialogue box (CTRL+I), shift all lines forward 0.30 seconds.
This is to ensure that the subtitles in the script always appear slightly after the
subtitles in the hardsubbed video.
* Change the vertical margin to ~400, so that the subs appear extremely high
on the image, but not at the very top of the screen.
* With all lines selected, right-click and press "Duplicate." Select all the
duplicate lines, and change the vertical margin to ~90. Clear all text from the
duplicate lines by deleting the text from one line while all duplicate lines are
selected.
3) After all that, you should have a script with ~350 auto-translated, broken-
English lines appearing near the top of the screen, and an equal number of
blank lines with the same start/end times. Load the hardsubbed video into
Aegisub. Go to the blank lines, and begin typing in the text from the hardsubs,
making whatever changes and edits you deem necessary. Be wary of "right
spelling, wrong word" typos such as you're/your, out/our, not/now, is/if, and the
like, as these will be harder to spot later. Aegisub's spellcheck can handle most
outright errors, so if you see you've made an obvious mistake (red wavy
underlined word), just move on to the next line and fix it with spellcheck later.
4) Timing issues: In the ideal scenario, the RU or other non-English subs will
line up perfectly with the English subs, in terms of where lines (defined here as
cells on the subtitle grid, not the number of actual lines of text appearing
onscreen) begin and end -- some timers include s-stuttering, some don't -- and
how long lines are broken up. The reason I recommend auto-translating the RU
text is to create a "guide" to show approximately what content is covered
6 | Page
within the lines as the RU subbers timed them. This avoids accidental skipping,
combining, or other mis-entries of content while transcribing.
If the RU lines are *shorter* than the English lines, i.e. two or 3 short RU lines
cover the same dialogue as one long ENG line, just enter everything in the ENG
line into the first RU line. For the subsequent RU line or lines, enter a
placeholder like //. You can then later use ctrl+F to find all of those, and use
"Join (keep first)" to sort them out.
If the RU lines are *longer* than the English lines, then you have more of an
annoyance to deal with. You will need to play the video to view the next English
line or lines, and then enter them into the RU line that spans their timecodes. I
recommend going into Aegisub's Hotkeys options and setting "Global Video
Play" to a single key like F9. By default, it's the more cumbersome Ctrl+P.
Of course, you can feel free to join or split lines later on, if you feel the original
English or RU times are too fragmented (lots of short lines with almost no time
on screen to read them) or too drawn-out (lengthy lines that stay up too long
and give away too much information too soon). I try to strike a middle ground
and keep lines between 2-5 seconds in length, where possible.
5) Styling: I usually use "Default" as something simple to make the RU subs

easily readable (often a "DVD-yellow" bold Arial), and set up another style like
"Main" for the actual English dialogue. Chances are, you'll want to use a few
more styles, such as an "Alternate" for overlapping dialogue, an Italic style for
thoughts, and/or a top-aligned style for background-type lines. If you're more
masochistic, you may also want to do different colors for different characters,
or to differentiate between onscreen/offscreen dialogue. You can use Aegisub's
features to combine the styling process with transcription. I do this by adding
special symbols to lines where I want a different style, like @ for
Alternate/Overlap, # for Thought, $ for Top, and so forth. Make sure they're
symbols that are unlikely to appear in the actual dialogue. Once the
transcription is complete, use Aegisub's Subtitles > Select Lines... option to
select all lines with a given symbol. Set the styles for it, then use Find+Replace
to change all those symbols to nothing. Repeat for all the symbols you've used.
For more complex styling, like color-coding by character, or the numerous

styles I used with Tonagura!, you'll likely have to go through line-by-line to set
7 | Page
the styles. Setting up hotkeys like F11 and F12 for "Global Prev Line" and
"Global Next Line" helps, as does Aegisub's Styling Assistant. To do
onscreen/offscreen styling correctly, you will need to view the beginning and
end of each line in Aegisub or a real-time watch, and shift between them if the
character moves on- or off-"camera" during the line. This can be done with \t
tags for gradual shifts, or by duplicating the line and adjusting start and end
times for gradual shifts. Refer to my Tonagura! scripts for examples of this.
The elements that constitute *good* styling could easily fill a guide by
themselves, and fortunately you can read such a guide right here.
6) Final check: Whether you do a full retiming/editing/TLC on the transcribed

subs or not, it's still good to look over them and make sure you haven't
introduced any errors. This is partly why I say to set the "new" subs to a
vertical margin of 90 -- they'll appear above the originals, so you can scroll
through line by line and easily spot any unintentional missing/extra words,
punctuation errors, correctly-spelled wrong words, and so forth. This can be
combined with a line-by-line restyling, as described in (5). Finally, delete the
auto-translated lines, set the margin override back to 0, and reverse the time-
shift you did earlier. (This would also be a time to shift times for any TV-DVD
differences, like sponsor screens.) At this point, you'll now have a script that's
the same or better than the script used in the TV-rip. Depending on what the
RU subbers did, you may also need to manually include some things like signs,
songs, and previews.
Projects done via this method: Tonagura!, Soul Link, Rizelmine (05 onwards),
Lime-Colored War Tales, Full Moon wo Sagashite, My Wife is a High School Girl,
G-On Riders, and probably any other future hardsub -> softsub conversion.
*** Alternate method: Brute-Force Transcription ***
This method is a last resort, to be used when you cannot find timed scripts for
your project in any language. Most steps are the same, except that you'll have
to create a timed script yourself. Open the hardsubbed video, and then open
the audio from that video. Knowledge some basic Japanese and familiarity with
the show in question will be useful. Go through the wave form and time
anything that resembles dialogue, without paying attention to the video -- use
8 | Page
"Audio+Subs View." It's best to err on the side of caution, and favor shorter line
times. It's easier to enter in a longer line and join subsequent lines into it, than
it is to start and stop the video while transcribing to enter several short lines
from the hardsubs into one longer-timed line. Of course, if you have a feel for
how the original fansub group did their timing, you can adjust your "pre-timing"
to fit. After all lines are pre-timed, shift them forward ~.30 seconds (or longer,
if you timed with significant amounts of lead-in), and transcribe away. Set the
vertical margins to ~90 if you wish to compare your transcriptions with the
original hardsubs for error-checking and styling. Signs will need to be added
manually by scanning the video; add karaoke from a separately-timed file.
Projcts done via this method: No full series thankfully, though I have done it for
a few random episodes of Saint October and Yes! Pretty Cure 5.
Advantages:
* Can work on any hardsubs, regardless of their nature, or whether non-English
scripts are available or not.
* Once setup process is learned, is as easy as typing on a word processor.
* Fast with proper setup, only limited by one's typing speed.
* With proper care, can avoid introducing errors and even fix errors in original
subs.
* Aegisub features can make styling nearly automatic.
Disadvantages:
* Labor-intensive, can be tedious.
* Cannot multitask, aside from maybe listening to music.
* Can introduce typos and other errors that can be hard to detect.
* Not everything may be covered in non-English scripts -- some manual
addition and retiming may still be necessary.
* Timing in non-English scripts might not line up with English fansub timing,
thus adding extra work.
Comparisons and Conclusions:
Obviously, getting TV-rip scripts directly from original fansub staffers or

softsubbed .mkvs is the easiest route to go, doubly so if signs and/or karaoke
were softsubbed. Between OCR and transcription, I have to choose
9 | Page
transcription. While the semi-automatic nature of OCR is nice (and I burned
through many backlogged rewatches), the errors it introduces are aggravating.
Transcribing an episode takes more keystrokes, but no more time than an
average OCR job -- no more than 40-50 minutes per ep, while OCR can run
longer due to color issues and bad luck. Transcribing allows me to fix errors
from the original subs in the process, or even rewrite lines on the fly. Luckily,
I'm conscientious enough to avoid introducing new mistakes in transcription, or
at least aware enough to catch them during retiming/editing/QC. Transcription
also offers a greater (that is, non-zero) chance than OCR of catching signs and
dialogue text outside the normal subtitle area.
Source URL : http://redonesubs.blogspot.in/p/extracting-hardsubs.html?m=1
10 | P a g e

Extracting Hardsubs

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Extracting Hardsubs

Transféré par

Droits d'auteur :

Formats disponibles

Extracting Hardsubs

Method Zero: Obtaining Scripts Directly

Method One: Optical Character Recognition (OCR):

Some tips the supplement the linked SubRip guide:

* Requires a LOT of work just to learn and set up.

Method Two: Transcription

Transcription from timed, non-English subs:

5) Styling: I usually use "Default" as something simple to make the RU subs

For more complex styling, like color-coding by character, or the numerous

6) Final check: Whether you do a full retiming/editing/TLC on the transcribed

* Alternate method: Brute-Force Transcription *

Comparisons and Conclusions:

Obviously, getting TV-rip scripts directly from original fansub staffers or

Source URL : http://redonesubs.blogspot.in/p/extracting-hardsubs.html?m=1

Vous aimerez peut-être aussi