But that number no longer is necessarily the same as the number of bytes in the string (there may be more bytes than characters). length(), for example, returns the number of characters in a string, just as before. String handling functions, for the most part, continue to operate in terms of characters. The changes basically come down to that the implementation no longer thinks that a character is always just a single byte. The basic building block of Perl strings has always been a "character". It is important that you too change your ideas, if you haven't already, so that "byte" and "character" no longer mean the same thing in your mind. Perl had to change internally to decouple "bytes" from "characters". That's what the term "Character Semantics" in the title of this section refers to. What matter are the characters as whole entities, and not usually the bytes that comprise them. This means that a character may require more than a single byte to represent it, and so the two terms are no longer equivalent. Then along comes Unicode which has room for over a million characters (and Perl allows for even more). There was no need to distinguish between "Byte" and "Character". "Byte Semantics" in the title of this section refers to this behavior. Thus a character was a byte, and a byte was a character, and there could be only 256 or fewer possible characters. # Byte and Character Semanticsīefore Unicode, most encodings used 8 bits (a single byte) to encode each character. If a Perl script begins with the Unicode BOM (UTF-16LE, UTF16-BE), or if the script looks like non- BOM-marked UTF-16 of either endianness, Perl will correctly read in the script as the appropriate Unicode encoding. If a Perl script begins with the bytes that form the UTF-8 encoding of the Unicode BYTE ORDER MARK ( BOM, see "Unicode Encodings"), those bytes are completely ignored. This is the only time when an explicit use utf8 is needed. If your Perl script is itself encoded in UTF-8, the use utf8 pragma must be explicitly included to enable recognition of that (in string or regular expression literals, or in identifier names). # use utf8 still needed to enable UTF-8 in scripts The encoding module has been deprecated since perl 5.18 and the perl internals it requires have been removed with perl 5.26. (See open.) # You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8. Use the :encoding(.) layer to read from and write to filehandles using the specified encoding. There are still several places where Unicode isn't fully supported, such as in filenames. Nor does it change the internal representation of strings, only their interpretation. (This is automatically selected if you use v5.12 or higher.) Failure to do this can trigger unexpected surprises. In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma use feature 'unicode_strings' is specified. # Safest if you use feature 'unicode_strings' While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features.Īlso, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below. Unicode support is an extensive requirement. # Important CaveatsĮven though some of this section may not be understandable to you on first reading, we think it's important enough to highlight some of the gotchas before delving further, so here goes: For a full discussion of all aspects of Unicode, see. It specifies many things outside the scope of Perl, such as how to display sequences of characters. This made it easy to do the conversions, and facilitated the adoption of Unicode.Īnd it worked nowadays, those legacy standards are rarely used. For ASCII and ISO-8859-1, the constant is 0. For quite a few of the various coding standards that existed when Unicode was first created, converting from each to Unicode essentially meant adding a constant to each code point in the original standard, and converting back meant just subtracting that same constant. Unicode aims to UNI-fy the en- CODE-ings of all the world's character sets into a single Standard. If you haven't already, before reading this document, you should become familiar with both perlunitut and perluniintro. Perlunicode - Unicode support in Perl #DESCRIPTION
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |