CaptionSync's Closed Captioning supports a variety of characters. This article details the character encoding technical aspects.
Introduction:
Most people are aware that the characters that appear in your text documents are actually represented by the computer as numeric values. The computer uses these numeric values to store and manipulate the characters and then converts them to a displayable image when presenting them on the screen or printing them on paper. What many people do not know, however, is that there are several different mappings between characters and numeric values (called the "character encoding"). That can make CaptionSync's job a challenging one.
The three most common character encodings are "ISO-8859-1" (this is used by Windows and is sometimes referred to as "ANSI"), "Mac-Roman" (used by older Mac software), and UTF-8 (a broader standard available to both Windows and Mac). All three encodings represent the basic characters used in English with the same numeric values (these characters are the 7-bit US-ASCII character set) -- the problems start when you get to the so-called "high-ASCII" characters (when 8 or more bits are used) which include accented characters and curly quotes. These problems are further compounded by the fact that the line 21 caption data used in broadcast captions uses its own special encoding that is again different from the above three encodings.
CaptionSync accepts transcript files in any of the three common encodings:
► ISO-8859-1
► Mac-Roman
► UTF-8
UTF-16 files are also accepted; they are automatically converted to UTF-8 when you submit them.
Text encoded as US-ASCII is still acceptable as it can be considered a subset of any of the above three encodings (and will be noted as UTF-8). The encoding of your text file is noted near the bottom of the details page for every submission.
CaptionSync will usually be able to automatically determine what encoding your text uses, but it can be ambiguous in some cases. If this happens, CaptionSync will ask you for clarification.
If you have any broadcast outputs selected (those marked with an asterisk in the output selection menu), CaptionSync will scan your transcript and warn you if you have any characters that are not supported by the line 21 encoding. You will be given the opportunity to alter your text, or override the warning; if you override the warning, any characters not supported in the line 21 encoding will be replaced by a space. See table 1 for a list of the characters supported for line 21 outputs.
If you have only web output types selected, CaptionSync will allow you to submit any valid characters in your file. The characters in your transcript will be mapped to appropriate encoding values for the output formats you selected. There are a couple of important points to consider:
► For a variety of technical reasons, not all characters in all encodings are supported. At this time, CaptionSync supports captioning in English, Spanish, French, and German, and all characters needed for these languages, plus a wide variety of special characters are supported. If your transcript includes characters that are not currently supported, they will be substituted with space.
► Each caption output format has a particular encoding specified for that format. Regardless of what encoding you submit your transcript in, the output file will be generated in the encoding specified by the standard for that format. It is possible that characters in your input transcript file may have no equivalent representation in the encoding specified for the output format you choose; when this happens, the offending character is replaced by a space. See table 2 for a list of the encoding specified for each output format.
Changing the Encoding of a File:
Most modern text editors are aware of multiple character encodings. In Windows, both Notepad and Word can convert between ISO-8859-1 and UTF-8 (Wordpad is not a good choice for this purpose). On the Mac, TextEdit can convert between ISO-8859-1, Mac-Roman, and UTF-8.
If you wish to determine what character encoding you have, choose to open the text file with a given encoding to see if it displays correctly. Note that ISO-8859-1 and Mac-Roman encodings are technically invalid in UTF-8 and if you try to open them as such your text editor will report an error. Unfortunately, however, UTF-8 and Mac-Roman can be technically valid in ISO-8859-1, however, strange characters will be present due to the mis-mapping of characters. Similarly, UTF-8 and ISO-8859-1 can be technically valid in Mac-Roman, but strange characters will be present due to the mis-mapping.
To change the encoding of a file, simply load the file in a text editor and use the "save as" function. Note however that file must first appear with the correct characters onscreen, otherwise, the strange characters that are converted to a different encoding.
Also, note that when loading a text file with a different encoding in Notepad and Word, there may still appear strange characters at the line endings. This is due to the fact that there are 3 different ways to note a line ending (independent of the encodings): Line Feed (LF) on the Mac and Unix, Carriage Return (CR) by older Mac software, and Carriage Return Line Feed (CR/LF) on Windows. The CR character without an LF may appear as ^M. The LF character without a CR may appear as •. Both of these can safely be ignored. TextEdit should automatically handle all the line ending cases.
The output caption files are designed to come back with the appropriate character encoding at all times, so the encoding should be largely transparent to you. However, if you need an output caption file in a different character encoding than you receive, you can again use a text editor to convert the encoding as you need.
Character Encoding types:
Transcript files are returned in the format that we receive them. If you are providing your own transcript, and you request either a .txt or .clean.txt file as one of your result types, these files will be returned with the same encoding that your original transcript was. If AST is providing the transcript, the encoding will be in the format provided by the transcriber -- either ISO-8859-1 or UTF-8.
If you can't see high ASCII characters being properly displayed, make sure you open your file with the correct encoding. You can do this in two ways:
- You can download your transcript from the Submission Details page, on the Transcript Text Encoding field. There, choose the UTF-8 text transcript link, and your file will be available with the UTF-8 encoding. You should be able to see the high ASCII characters being properly displayed in your text editor. Notepad (if you're on Windows) and TextEdit (if you're on Mac) are good choices to make sure these characters are displayed.
- If you open the transcript you received from us in a text editor, make sure you do it by using the File -> Open sequence, preferably on Notepad (if you're on Windows) or TextEdit (if you're on Mac). Choose the correct encoding (you can see this on the Submission Details page, on the Transcript Text Encoding field). Then do a Save As using the UTF-8 encoding. Now you should be able to see the high ASCII characters being properly displayed in your text editor.
Table 1: Allowable characters for line 21 broadcast:
Character Type | Valid Characters |
upper-case alphabet |
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
lower-case alphabet |
a b c d e f g h i j k l m n o p q r s t u v w x y z |
numerals |
0 1 2 3 4 5 6 7 8 9 |
accented letters | á à â ã ä å ç é è ê ë í î ì ï Ñ ñ ó ô ò õ ö ú û ù ü À Á Â Ã Ä Å Ç È É Ê Ë Ì Í Î Ï Ò Ó Ô Õ Ö Ù Ú Û Ü |
punctuation and signs | ! , . ; : ' " # % & @ / ( ) [ ] + - < = > ? $ ¿ ¡ * |
Note that the .cap and .asc files do not represent the full broadcast character set (some accented letters are not supported in these formats).
All caption files produced by CaptionSync use the UTF-8 encoding, with the exception of those listed in Table 2 below.
NOTE: Transcript outputs, like .clean.txt, .txt, .prod.txt and .tagged.txt, are created employing the same text encoding the original transcript file uses. For example, if the transcriber saved the original transcript file in ANSI/ISO-8859-1, then the transcript outputs are also encoded in ANSI/ISO-8859-1.
Table 2: Character encoding for the caption output formats:
Output type | Extension(s) | Encoding |
Web Outputs |
||
SAMI | .smi, .ms.smi | ISO-8859-1 |
WMP | .wmp.txt | ISO-8859-1 |
QuickTime | .qt.txt, .kar.txt, .qtwrd.txt, .qtwrdrvl.txt | ISO-8859-1 / Mac-Roman (depends on input encoding; UTF-8 is mapped to ISO-8859-1) |
Real | .rt | ISO-8859-1 / Mac-Roman (depends on input encoding; UTF-8 is mapped to ISO-8859-1) |
LRC | .lrc | ISO-8859-1 |
Tegrity Lecture | .tegrity.txt | ISO-8859-1 |
DVD Outputs |
||
SCC | .scc, .ndf.scc, .pc.scc, .pc.ndf.scc | EIA-608 |
VideoTape Outputs |
||
Cheetah ASCII | .asc | EIA-608 |
Cheetah Binary | .cap | EIA-608 |
RapidText | .xms | ISO-8859-1 (not specified by manufacturer) |
Comments
0 comments
Please sign in to leave a comment.