Closed Captioning your files with CaptionSync allows you to submit your own formatted transcript. This article shows how to select the appropriate character encoding for your transcript file.
Character Encoding Formatting:
If you wish to supply us with your own transcript, you need to ensure you choose the appropriate character encoding:
- For English transcripts, we recommend that you use US-ASCII as the encoding.
- For other languages (Spanish, French or German), choose UTF-8 as the encoding.
How to save your transcript as a UTF-8 .txt file -- Windows users:
On Notepad or Word, do File → Save As, select the .txt type, and choose UTF-8 as encoding:
How to save your transcript as a UTF-8 .txt file -- Mac users:
On TextEdit, do File → Save As (you may need to hold down the "Option" key for that to appear), select the .txt type, and choose UTF-8 as encoding:
You can also set TextEdit's default encoding to be UTF-8, however keep in mind that TextEdit tries to automatically determine the encoding on any file it opens (for example it can read ISO-8859-1) and will not change the encoding when it updates the file. You need to do an explicit Save As.
CaptionSync supports three different character encodings:
- ISO-8859-1 (This is the Windows encoding)
- Mac-Roman (This is the Mac encoding)
US-ASCII is considered a subset of all three of these recordings – making it a good choice for content with only English characters, as that provides the widest compatibility with the character restrictions of the various output formats.
But even in English there are a number of characters such curly quotes “ ” ‘ ’ which are not part of the US-ASCII character set. This is even more pertinent if you are submitting transcripts in languages other than English, as characters such as ¿ é à ñ are needed. This is why we recommend saving your transcript file as UTF-8, especially if you are editing text documents across multiple platforms (e.g. Windows and Mac).
Support for these encodings is transparent – you can just submit your transcript and CaptionSync will figure out what encoding you are using. If CaptionSync cannot determine what encoding is being used with a high degree of certainty, it will ask you. Since the same code can mean different valid characters, ISO-8859-1 can be mistakenly identified as Mac-Roman (and vice-versa), which results in strange characters (é showing up as È for example). UTF-8 encoding is the most reliable as it leaves nothing to guesswork – it cannot be said for either ISO-8859-1 or Mac-Roman.
CaptionSync produces over 60 different output formats and it is important to know that not all of the output formats are capable of representing all characters. Because this gets to be a rather involved topic, we have created a Character Encoding Technical Reference to help explain its complexity.
Please sign in to leave a comment.