[15957] | 1 | =head1 NAME
|
---|
| 2 |
|
---|
| 3 | perlpacktut - tutorial on C<pack> and C<unpack>
|
---|
| 4 |
|
---|
| 5 | =head1 DESCRIPTION
|
---|
| 6 |
|
---|
| 7 | C<pack> and C<unpack> are two functions for transforming data according
|
---|
| 8 | to a user-defined template, between the guarded way Perl stores values
|
---|
| 9 | and some well-defined representation as might be required in the
|
---|
| 10 | environment of a Perl program. Unfortunately, they're also two of
|
---|
| 11 | the most misunderstood and most often overlooked functions that Perl
|
---|
| 12 | provides. This tutorial will demystify them for you.
|
---|
| 13 |
|
---|
| 14 |
|
---|
| 15 | =head1 The Basic Principle
|
---|
| 16 |
|
---|
| 17 | Most programming languages don't shelter the memory where variables are
|
---|
| 18 | stored. In C, for instance, you can take the address of some variable,
|
---|
| 19 | and the C<sizeof> operator tells you how many bytes are allocated to
|
---|
| 20 | the variable. Using the address and the size, you may access the storage
|
---|
| 21 | to your heart's content.
|
---|
| 22 |
|
---|
| 23 | In Perl, you just can't access memory at random, but the structural and
|
---|
| 24 | representational conversion provided by C<pack> and C<unpack> is an
|
---|
| 25 | excellent alternative. The C<pack> function converts values to a byte
|
---|
| 26 | sequence containing representations according to a given specification,
|
---|
| 27 | the so-called "template" argument. C<unpack> is the reverse process,
|
---|
| 28 | deriving some values from the contents of a string of bytes. (Be cautioned,
|
---|
| 29 | however, that not all that has been packed together can be neatly unpacked -
|
---|
| 30 | a very common experience as seasoned travellers are likely to confirm.)
|
---|
| 31 |
|
---|
| 32 | Why, you may ask, would you need a chunk of memory containing some values
|
---|
| 33 | in binary representation? One good reason is input and output accessing
|
---|
| 34 | some file, a device, or a network connection, whereby this binary
|
---|
| 35 | representation is either forced on you or will give you some benefit
|
---|
| 36 | in processing. Another cause is passing data to some system call that
|
---|
| 37 | is not available as a Perl function: C<syscall> requires you to provide
|
---|
| 38 | parameters stored in the way it happens in a C program. Even text processing
|
---|
| 39 | (as shown in the next section) may be simplified with judicious usage
|
---|
| 40 | of these two functions.
|
---|
| 41 |
|
---|
| 42 | To see how (un)packing works, we'll start with a simple template
|
---|
| 43 | code where the conversion is in low gear: between the contents of a byte
|
---|
| 44 | sequence and a string of hexadecimal digits. Let's use C<unpack>, since
|
---|
| 45 | this is likely to remind you of a dump program, or some desperate last
|
---|
| 46 | message unfortunate programs are wont to throw at you before they expire
|
---|
| 47 | into the wild blue yonder. Assuming that the variable C<$mem> holds a
|
---|
| 48 | sequence of bytes that we'd like to inspect without assuming anything
|
---|
| 49 | about its meaning, we can write
|
---|
| 50 |
|
---|
| 51 | my( $hex ) = unpack( 'H*', $mem );
|
---|
| 52 | print "$hex\n";
|
---|
| 53 |
|
---|
| 54 | whereupon we might see something like this, with each pair of hex digits
|
---|
| 55 | corresponding to a byte:
|
---|
| 56 |
|
---|
| 57 | 41204d414e204120504c414e20412043414e414c2050414e414d41
|
---|
| 58 |
|
---|
| 59 | What was in this chunk of memory? Numbers, characters, or a mixture of
|
---|
| 60 | both? Assuming that we're on a computer where ASCII (or some similar)
|
---|
| 61 | encoding is used: hexadecimal values in the range C<0x40> - C<0x5A>
|
---|
| 62 | indicate an uppercase letter, and C<0x20> encodes a space. So we might
|
---|
| 63 | assume it is a piece of text, which some are able to read like a tabloid;
|
---|
| 64 | but others will have to get hold of an ASCII table and relive that
|
---|
| 65 | firstgrader feeling. Not caring too much about which way to read this,
|
---|
| 66 | we note that C<unpack> with the template code C<H> converts the contents
|
---|
| 67 | of a sequence of bytes into the customary hexadecimal notation. Since
|
---|
| 68 | "a sequence of" is a pretty vague indication of quantity, C<H> has been
|
---|
| 69 | defined to convert just a single hexadecimal digit unless it is followed
|
---|
| 70 | by a repeat count. An asterisk for the repeat count means to use whatever
|
---|
| 71 | remains.
|
---|
| 72 |
|
---|
| 73 | The inverse operation - packing byte contents from a string of hexadecimal
|
---|
| 74 | digits - is just as easily written. For instance:
|
---|
| 75 |
|
---|
| 76 | my $s = pack( 'H2' x 10, map { "3$_" } ( 0..9 ) );
|
---|
| 77 | print "$s\n";
|
---|
| 78 |
|
---|
| 79 | Since we feed a list of ten 2-digit hexadecimal strings to C<pack>, the
|
---|
| 80 | pack template should contain ten pack codes. If this is run on a computer
|
---|
| 81 | with ASCII character coding, it will print C<0123456789>.
|
---|
| 82 |
|
---|
| 83 |
|
---|
| 84 | =head1 Packing Text
|
---|
| 85 |
|
---|
| 86 | Let's suppose you've got to read in a data file like this:
|
---|
| 87 |
|
---|
| 88 | Date |Description | Income|Expenditure
|
---|
| 89 | 01/24/2001 Ahmed's Camel Emporium 1147.99
|
---|
| 90 | 01/28/2001 Flea spray 24.99
|
---|
| 91 | 01/29/2001 Camel rides to tourists 235.00
|
---|
| 92 |
|
---|
| 93 | How do we do it? You might think first to use C<split>; however, since
|
---|
| 94 | C<split> collapses blank fields, you'll never know whether a record was
|
---|
| 95 | income or expenditure. Oops. Well, you could always use C<substr>:
|
---|
| 96 |
|
---|
| 97 | while (<>) {
|
---|
| 98 | my $date = substr($_, 0, 11);
|
---|
| 99 | my $desc = substr($_, 12, 27);
|
---|
| 100 | my $income = substr($_, 40, 7);
|
---|
| 101 | my $expend = substr($_, 52, 7);
|
---|
| 102 | ...
|
---|
| 103 | }
|
---|
| 104 |
|
---|
| 105 | It's not really a barrel of laughs, is it? In fact, it's worse than it
|
---|
| 106 | may seem; the eagle-eyed may notice that the first field should only be
|
---|
| 107 | 10 characters wide, and the error has propagated right through the other
|
---|
| 108 | numbers - which we've had to count by hand. So it's error-prone as well
|
---|
| 109 | as horribly unfriendly.
|
---|
| 110 |
|
---|
| 111 | Or maybe we could use regular expressions:
|
---|
| 112 |
|
---|
| 113 | while (<>) {
|
---|
| 114 | my($date, $desc, $income, $expend) =
|
---|
| 115 | m|(\d\d/\d\d/\d{4}) (.{27}) (.{7})(.*)|;
|
---|
| 116 | ...
|
---|
| 117 | }
|
---|
| 118 |
|
---|
| 119 | Urgh. Well, it's a bit better, but - well, would you want to maintain
|
---|
| 120 | that?
|
---|
| 121 |
|
---|
| 122 | Hey, isn't Perl supposed to make this sort of thing easy? Well, it does,
|
---|
| 123 | if you use the right tools. C<pack> and C<unpack> are designed to help
|
---|
| 124 | you out when dealing with fixed-width data like the above. Let's have a
|
---|
| 125 | look at a solution with C<unpack>:
|
---|
| 126 |
|
---|
| 127 | while (<>) {
|
---|
| 128 | my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_);
|
---|
| 129 | ...
|
---|
| 130 | }
|
---|
| 131 |
|
---|
| 132 | That looks a bit nicer; but we've got to take apart that weird template.
|
---|
| 133 | Where did I pull that out of?
|
---|
| 134 |
|
---|
| 135 | OK, let's have a look at some of our data again; in fact, we'll include
|
---|
| 136 | the headers, and a handy ruler so we can keep track of where we are.
|
---|
| 137 |
|
---|
| 138 | 1 2 3 4 5
|
---|
| 139 | 1234567890123456789012345678901234567890123456789012345678
|
---|
| 140 | Date |Description | Income|Expenditure
|
---|
| 141 | 01/28/2001 Flea spray 24.99
|
---|
| 142 | 01/29/2001 Camel rides to tourists 235.00
|
---|
| 143 |
|
---|
| 144 | From this, we can see that the date column stretches from column 1 to
|
---|
| 145 | column 10 - ten characters wide. The C<pack>-ese for "character" is
|
---|
| 146 | C<A>, and ten of them are C<A10>. So if we just wanted to extract the
|
---|
| 147 | dates, we could say this:
|
---|
| 148 |
|
---|
| 149 | my($date) = unpack("A10", $_);
|
---|
| 150 |
|
---|
| 151 | OK, what's next? Between the date and the description is a blank column;
|
---|
| 152 | we want to skip over that. The C<x> template means "skip forward", so we
|
---|
| 153 | want one of those. Next, we have another batch of characters, from 12 to
|
---|
| 154 | 38. That's 27 more characters, hence C<A27>. (Don't make the fencepost
|
---|
| 155 | error - there are 27 characters between 12 and 38, not 26. Count 'em!)
|
---|
| 156 |
|
---|
| 157 | Now we skip another character and pick up the next 7 characters:
|
---|
| 158 |
|
---|
| 159 | my($date,$description,$income) = unpack("A10xA27xA7", $_);
|
---|
| 160 |
|
---|
| 161 | Now comes the clever bit. Lines in our ledger which are just income and
|
---|
| 162 | not expenditure might end at column 46. Hence, we don't want to tell our
|
---|
| 163 | C<unpack> pattern that we B<need> to find another 12 characters; we'll
|
---|
| 164 | just say "if there's anything left, take it". As you might guess from
|
---|
| 165 | regular expressions, that's what the C<*> means: "use everything
|
---|
| 166 | remaining".
|
---|
| 167 |
|
---|
| 168 | =over 3
|
---|
| 169 |
|
---|
| 170 | =item *
|
---|
| 171 |
|
---|
| 172 | Be warned, though, that unlike regular expressions, if the C<unpack>
|
---|
| 173 | template doesn't match the incoming data, Perl will scream and die.
|
---|
| 174 |
|
---|
| 175 | =back
|
---|
| 176 |
|
---|
| 177 |
|
---|
| 178 | Hence, putting it all together:
|
---|
| 179 |
|
---|
| 180 | my($date,$description,$income,$expend) = unpack("A10xA27xA7xA*", $_);
|
---|
| 181 |
|
---|
| 182 | Now, that's our data parsed. I suppose what we might want to do now is
|
---|
| 183 | total up our income and expenditure, and add another line to the end of
|
---|
| 184 | our ledger - in the same format - saying how much we've brought in and
|
---|
| 185 | how much we've spent:
|
---|
| 186 |
|
---|
| 187 | while (<>) {
|
---|
| 188 | my($date, $desc, $income, $expend) = unpack("A10xA27xA7xA*", $_);
|
---|
| 189 | $tot_income += $income;
|
---|
| 190 | $tot_expend += $expend;
|
---|
| 191 | }
|
---|
| 192 |
|
---|
| 193 | $tot_income = sprintf("%.2f", $tot_income); # Get them into
|
---|
| 194 | $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format
|
---|
| 195 |
|
---|
| 196 | $date = POSIX::strftime("%m/%d/%Y", localtime);
|
---|
| 197 |
|
---|
| 198 | # OK, let's go:
|
---|
| 199 |
|
---|
| 200 | print pack("A10xA27xA7xA*", $date, "Totals", $tot_income, $tot_expend);
|
---|
| 201 |
|
---|
| 202 | Oh, hmm. That didn't quite work. Let's see what happened:
|
---|
| 203 |
|
---|
| 204 | 01/24/2001 Ahmed's Camel Emporium 1147.99
|
---|
| 205 | 01/28/2001 Flea spray 24.99
|
---|
| 206 | 01/29/2001 Camel rides to tourists 1235.00
|
---|
| 207 | 03/23/2001Totals 1235.001172.98
|
---|
| 208 |
|
---|
| 209 | OK, it's a start, but what happened to the spaces? We put C<x>, didn't
|
---|
| 210 | we? Shouldn't it skip forward? Let's look at what L<perlfunc/pack> says:
|
---|
| 211 |
|
---|
| 212 | x A null byte.
|
---|
| 213 |
|
---|
| 214 | Urgh. No wonder. There's a big difference between "a null byte",
|
---|
| 215 | character zero, and "a space", character 32. Perl's put something
|
---|
| 216 | between the date and the description - but unfortunately, we can't see
|
---|
| 217 | it!
|
---|
| 218 |
|
---|
| 219 | What we actually need to do is expand the width of the fields. The C<A>
|
---|
| 220 | format pads any non-existent characters with spaces, so we can use the
|
---|
| 221 | additional spaces to line up our fields, like this:
|
---|
| 222 |
|
---|
| 223 | print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
|
---|
| 224 |
|
---|
| 225 | (Note that you can put spaces in the template to make it more readable,
|
---|
| 226 | but they don't translate to spaces in the output.) Here's what we got
|
---|
| 227 | this time:
|
---|
| 228 |
|
---|
| 229 | 01/24/2001 Ahmed's Camel Emporium 1147.99
|
---|
| 230 | 01/28/2001 Flea spray 24.99
|
---|
| 231 | 01/29/2001 Camel rides to tourists 1235.00
|
---|
| 232 | 03/23/2001 Totals 1235.00 1172.98
|
---|
| 233 |
|
---|
| 234 | That's a bit better, but we still have that last column which needs to
|
---|
| 235 | be moved further over. There's an easy way to fix this up:
|
---|
| 236 | unfortunately, we can't get C<pack> to right-justify our fields, but we
|
---|
| 237 | can get C<sprintf> to do it:
|
---|
| 238 |
|
---|
| 239 | $tot_income = sprintf("%.2f", $tot_income);
|
---|
| 240 | $tot_expend = sprintf("%12.2f", $tot_expend);
|
---|
| 241 | $date = POSIX::strftime("%m/%d/%Y", localtime);
|
---|
| 242 | print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
|
---|
| 243 |
|
---|
| 244 | This time we get the right answer:
|
---|
| 245 |
|
---|
| 246 | 01/28/2001 Flea spray 24.99
|
---|
| 247 | 01/29/2001 Camel rides to tourists 1235.00
|
---|
| 248 | 03/23/2001 Totals 1235.00 1172.98
|
---|
| 249 |
|
---|
| 250 | So that's how we consume and produce fixed-width data. Let's recap what
|
---|
| 251 | we've seen of C<pack> and C<unpack> so far:
|
---|
| 252 |
|
---|
| 253 | =over 3
|
---|
| 254 |
|
---|
| 255 | =item *
|
---|
| 256 |
|
---|
| 257 | Use C<pack> to go from several pieces of data to one fixed-width
|
---|
| 258 | version; use C<unpack> to turn a fixed-width-format string into several
|
---|
| 259 | pieces of data.
|
---|
| 260 |
|
---|
| 261 | =item *
|
---|
| 262 |
|
---|
| 263 | The pack format C<A> means "any character"; if you're C<pack>ing and
|
---|
| 264 | you've run out of things to pack, C<pack> will fill the rest up with
|
---|
| 265 | spaces.
|
---|
| 266 |
|
---|
| 267 | =item *
|
---|
| 268 |
|
---|
| 269 | C<x> means "skip a byte" when C<unpack>ing; when C<pack>ing, it means
|
---|
| 270 | "introduce a null byte" - that's probably not what you mean if you're
|
---|
| 271 | dealing with plain text.
|
---|
| 272 |
|
---|
| 273 | =item *
|
---|
| 274 |
|
---|
| 275 | You can follow the formats with numbers to say how many characters
|
---|
| 276 | should be affected by that format: C<A12> means "take 12 characters";
|
---|
| 277 | C<x6> means "skip 6 bytes" or "character 0, 6 times".
|
---|
| 278 |
|
---|
| 279 | =item *
|
---|
| 280 |
|
---|
| 281 | Instead of a number, you can use C<*> to mean "consume everything else
|
---|
| 282 | left".
|
---|
| 283 |
|
---|
| 284 | B<Warning>: when packing multiple pieces of data, C<*> only means
|
---|
| 285 | "consume all of the current piece of data". That's to say
|
---|
| 286 |
|
---|
| 287 | pack("A*A*", $one, $two)
|
---|
| 288 |
|
---|
| 289 | packs all of C<$one> into the first C<A*> and then all of C<$two> into
|
---|
| 290 | the second. This is a general principle: each format character
|
---|
| 291 | corresponds to one piece of data to be C<pack>ed.
|
---|
| 292 |
|
---|
| 293 | =back
|
---|
| 294 |
|
---|
| 295 |
|
---|
| 296 |
|
---|
| 297 | =head1 Packing Numbers
|
---|
| 298 |
|
---|
| 299 | So much for textual data. Let's get onto the meaty stuff that C<pack>
|
---|
| 300 | and C<unpack> are best at: handling binary formats for numbers. There is,
|
---|
| 301 | of course, not just one binary format - life would be too simple - but
|
---|
| 302 | Perl will do all the finicky labor for you.
|
---|
| 303 |
|
---|
| 304 |
|
---|
| 305 | =head2 Integers
|
---|
| 306 |
|
---|
| 307 | Packing and unpacking numbers implies conversion to and from some
|
---|
| 308 | I<specific> binary representation. Leaving floating point numbers
|
---|
| 309 | aside for the moment, the salient properties of any such representation
|
---|
| 310 | are:
|
---|
| 311 |
|
---|
| 312 | =over 4
|
---|
| 313 |
|
---|
| 314 | =item *
|
---|
| 315 |
|
---|
| 316 | the number of bytes used for storing the integer,
|
---|
| 317 |
|
---|
| 318 | =item *
|
---|
| 319 |
|
---|
| 320 | whether the contents are interpreted as a signed or unsigned number,
|
---|
| 321 |
|
---|
| 322 | =item *
|
---|
| 323 |
|
---|
| 324 | the byte ordering: whether the first byte is the least or most
|
---|
| 325 | significant byte (or: little-endian or big-endian, respectively).
|
---|
| 326 |
|
---|
| 327 | =back
|
---|
| 328 |
|
---|
| 329 | So, for instance, to pack 20302 to a signed 16 bit integer in your
|
---|
| 330 | computer's representation you write
|
---|
| 331 |
|
---|
| 332 | my $ps = pack( 's', 20302 );
|
---|
| 333 |
|
---|
| 334 | Again, the result is a string, now containing 2 bytes. If you print
|
---|
| 335 | this string (which is, generally, not recommended) you might see
|
---|
| 336 | C<ON> or C<NO> (depending on your system's byte ordering) - or something
|
---|
| 337 | entirely different if your computer doesn't use ASCII character encoding.
|
---|
| 338 | Unpacking C<$ps> with the same template returns the original integer value:
|
---|
| 339 |
|
---|
| 340 | my( $s ) = unpack( 's', $ps );
|
---|
| 341 |
|
---|
| 342 | This is true for all numeric template codes. But don't expect miracles:
|
---|
| 343 | if the packed value exceeds the allotted byte capacity, high order bits
|
---|
| 344 | are silently discarded, and unpack certainly won't be able to pull them
|
---|
| 345 | back out of some magic hat. And, when you pack using a signed template
|
---|
| 346 | code such as C<s>, an excess value may result in the sign bit
|
---|
| 347 | getting set, and unpacking this will smartly return a negative value.
|
---|
| 348 |
|
---|
| 349 | 16 bits won't get you too far with integers, but there is C<l> and C<L>
|
---|
| 350 | for signed and unsigned 32-bit integers. And if this is not enough and
|
---|
| 351 | your system supports 64 bit integers you can push the limits much closer
|
---|
| 352 | to infinity with pack codes C<q> and C<Q>. A notable exception is provided
|
---|
| 353 | by pack codes C<i> and C<I> for signed and unsigned integers of the
|
---|
| 354 | "local custom" variety: Such an integer will take up as many bytes as
|
---|
| 355 | a local C compiler returns for C<sizeof(int)>, but it'll use I<at least>
|
---|
| 356 | 32 bits.
|
---|
| 357 |
|
---|
| 358 | Each of the integer pack codes C<sSlLqQ> results in a fixed number of bytes,
|
---|
| 359 | no matter where you execute your program. This may be useful for some
|
---|
| 360 | applications, but it does not provide for a portable way to pass data
|
---|
| 361 | structures between Perl and C programs (bound to happen when you call
|
---|
| 362 | XS extensions or the Perl function C<syscall>), or when you read or
|
---|
| 363 | write binary files. What you'll need in this case are template codes that
|
---|
| 364 | depend on what your local C compiler compiles when you code C<short> or
|
---|
| 365 | C<unsigned long>, for instance. These codes and their corresponding
|
---|
| 366 | byte lengths are shown in the table below. Since the C standard leaves
|
---|
| 367 | much leeway with respect to the relative sizes of these data types, actual
|
---|
| 368 | values may vary, and that's why the values are given as expressions in
|
---|
| 369 | C and Perl. (If you'd like to use values from C<%Config> in your program
|
---|
| 370 | you have to import it with C<use Config>.)
|
---|
| 371 |
|
---|
| 372 | signed unsigned byte length in C byte length in Perl
|
---|
| 373 | s! S! sizeof(short) $Config{shortsize}
|
---|
| 374 | i! I! sizeof(int) $Config{intsize}
|
---|
| 375 | l! L! sizeof(long) $Config{longsize}
|
---|
| 376 | q! Q! sizeof(long long) $Config{longlongsize}
|
---|
| 377 |
|
---|
| 378 | The C<i!> and C<I!> codes aren't different from C<i> and C<I>; they are
|
---|
| 379 | tolerated for completeness' sake.
|
---|
| 380 |
|
---|
| 381 |
|
---|
| 382 | =head2 Unpacking a Stack Frame
|
---|
| 383 |
|
---|
| 384 | Requesting a particular byte ordering may be necessary when you work with
|
---|
| 385 | binary data coming from some specific architecture whereas your program could
|
---|
| 386 | run on a totally different system. As an example, assume you have 24 bytes
|
---|
| 387 | containing a stack frame as it happens on an Intel 8086:
|
---|
| 388 |
|
---|
| 389 | +---------+ +----+----+ +---------+
|
---|
| 390 | TOS: | IP | TOS+4:| FL | FH | FLAGS TOS+14:| SI |
|
---|
| 391 | +---------+ +----+----+ +---------+
|
---|
| 392 | | CS | | AL | AH | AX | DI |
|
---|
| 393 | +---------+ +----+----+ +---------+
|
---|
| 394 | | BL | BH | BX | BP |
|
---|
| 395 | +----+----+ +---------+
|
---|
| 396 | | CL | CH | CX | DS |
|
---|
| 397 | +----+----+ +---------+
|
---|
| 398 | | DL | DH | DX | ES |
|
---|
| 399 | +----+----+ +---------+
|
---|
| 400 |
|
---|
| 401 | First, we note that this time-honored 16-bit CPU uses little-endian order,
|
---|
| 402 | and that's why the low order byte is stored at the lower address. To
|
---|
| 403 | unpack such a (signed) short we'll have to use code C<v>. A repeat
|
---|
| 404 | count unpacks all 12 shorts:
|
---|
| 405 |
|
---|
| 406 | my( $ip, $cs, $flags, $ax, $bx, $cd, $dx, $si, $di, $bp, $ds, $es ) =
|
---|
| 407 | unpack( 'v12', $frame );
|
---|
| 408 |
|
---|
| 409 | Alternatively, we could have used C<C> to unpack the individually
|
---|
| 410 | accessible byte registers FL, FH, AL, AH, etc.:
|
---|
| 411 |
|
---|
| 412 | my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) =
|
---|
| 413 | unpack( 'C10', substr( $frame, 4, 10 ) );
|
---|
| 414 |
|
---|
| 415 | It would be nice if we could do this in one fell swoop: unpack a short,
|
---|
| 416 | back up a little, and then unpack 2 bytes. Since Perl I<is> nice, it
|
---|
| 417 | proffers the template code C<X> to back up one byte. Putting this all
|
---|
| 418 | together, we may now write:
|
---|
| 419 |
|
---|
| 420 | my( $ip, $cs,
|
---|
| 421 | $flags,$fl,$fh,
|
---|
| 422 | $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh,
|
---|
| 423 | $si, $di, $bp, $ds, $es ) =
|
---|
| 424 | unpack( 'v2' . ('vXXCC' x 5) . 'v5', $frame );
|
---|
| 425 |
|
---|
| 426 | (The clumsy construction of the template can be avoided - just read on!)
|
---|
| 427 |
|
---|
| 428 | We've taken some pains to construct the template so that it matches
|
---|
| 429 | the contents of our frame buffer. Otherwise we'd either get undefined values,
|
---|
| 430 | or C<unpack> could not unpack all. If C<pack> runs out of items, it will
|
---|
| 431 | supply null strings (which are coerced into zeroes whenever the pack code
|
---|
| 432 | says so).
|
---|
| 433 |
|
---|
| 434 |
|
---|
| 435 | =head2 How to Eat an Egg on a Net
|
---|
| 436 |
|
---|
| 437 | The pack code for big-endian (high order byte at the lowest address) is
|
---|
| 438 | C<n> for 16 bit and C<N> for 32 bit integers. You use these codes
|
---|
| 439 | if you know that your data comes from a compliant architecture, but,
|
---|
| 440 | surprisingly enough, you should also use these pack codes if you
|
---|
| 441 | exchange binary data, across the network, with some system that you
|
---|
| 442 | know next to nothing about. The simple reason is that this
|
---|
| 443 | order has been chosen as the I<network order>, and all standard-fearing
|
---|
| 444 | programs ought to follow this convention. (This is, of course, a stern
|
---|
| 445 | backing for one of the Lilliputian parties and may well influence the
|
---|
| 446 | political development there.) So, if the protocol expects you to send
|
---|
| 447 | a message by sending the length first, followed by just so many bytes,
|
---|
| 448 | you could write:
|
---|
| 449 |
|
---|
| 450 | my $buf = pack( 'N', length( $msg ) ) . $msg;
|
---|
| 451 |
|
---|
| 452 | or even:
|
---|
| 453 |
|
---|
| 454 | my $buf = pack( 'NA*', length( $msg ), $msg );
|
---|
| 455 |
|
---|
| 456 | and pass C<$buf> to your send routine. Some protocols demand that the
|
---|
| 457 | count should include the length of the count itself: then just add 4
|
---|
| 458 | to the data length. (But make sure to read L<"Lengths and Widths"> before
|
---|
| 459 | you really code this!)
|
---|
| 460 |
|
---|
| 461 |
|
---|
| 462 |
|
---|
| 463 | =head2 Floating point Numbers
|
---|
| 464 |
|
---|
| 465 | For packing floating point numbers you have the choice between the
|
---|
| 466 | pack codes C<f> and C<d> which pack into (or unpack from) single-precision or
|
---|
| 467 | double-precision representation as it is provided by your system. (There
|
---|
| 468 | is no such thing as a network representation for reals, so if you want
|
---|
| 469 | to send your real numbers across computer boundaries, you'd better stick
|
---|
| 470 | to ASCII representation, unless you're absolutely sure what's on the other
|
---|
| 471 | end of the line.)
|
---|
| 472 |
|
---|
| 473 |
|
---|
| 474 |
|
---|
| 475 | =head1 Exotic Templates
|
---|
| 476 |
|
---|
| 477 |
|
---|
| 478 | =head2 Bit Strings
|
---|
| 479 |
|
---|
| 480 | Bits are the atoms in the memory world. Access to individual bits may
|
---|
| 481 | have to be used either as a last resort or because it is the most
|
---|
| 482 | convenient way to handle your data. Bit string (un)packing converts
|
---|
| 483 | between strings containing a series of C<0> and C<1> characters and
|
---|
| 484 | a sequence of bytes each containing a group of 8 bits. This is almost
|
---|
| 485 | as simple as it sounds, except that there are two ways the contents of
|
---|
| 486 | a byte may be written as a bit string. Let's have a look at an annotated
|
---|
| 487 | byte:
|
---|
| 488 |
|
---|
| 489 | 7 6 5 4 3 2 1 0
|
---|
| 490 | +-----------------+
|
---|
| 491 | | 1 0 0 0 1 1 0 0 |
|
---|
| 492 | +-----------------+
|
---|
| 493 | MSB LSB
|
---|
| 494 |
|
---|
| 495 | It's egg-eating all over again: Some think that as a bit string this should
|
---|
| 496 | be written "10001100" i.e. beginning with the most significant bit, others
|
---|
| 497 | insist on "00110001". Well, Perl isn't biased, so that's why we have two bit
|
---|
| 498 | string codes:
|
---|
| 499 |
|
---|
| 500 | $byte = pack( 'B8', '10001100' ); # start with MSB
|
---|
| 501 | $byte = pack( 'b8', '00110001' ); # start with LSB
|
---|
| 502 |
|
---|
| 503 | It is not possible to pack or unpack bit fields - just integral bytes.
|
---|
| 504 | C<pack> always starts at the next byte boundary and "rounds up" to the
|
---|
| 505 | next multiple of 8 by adding zero bits as required. (If you do want bit
|
---|
| 506 | fields, there is L<perlfunc/vec>. Or you could implement bit field
|
---|
| 507 | handling at the character string level, using split, substr, and
|
---|
| 508 | concatenation on unpacked bit strings.)
|
---|
| 509 |
|
---|
| 510 | To illustrate unpacking for bit strings, we'll decompose a simple
|
---|
| 511 | status register (a "-" stands for a "reserved" bit):
|
---|
| 512 |
|
---|
| 513 | +-----------------+-----------------+
|
---|
| 514 | | S Z - A - P - C | - - - - O D I T |
|
---|
| 515 | +-----------------+-----------------+
|
---|
| 516 | MSB LSB MSB LSB
|
---|
| 517 |
|
---|
| 518 | Converting these two bytes to a string can be done with the unpack
|
---|
| 519 | template C<'b16'>. To obtain the individual bit values from the bit
|
---|
| 520 | string we use C<split> with the "empty" separator pattern which dissects
|
---|
| 521 | into individual characters. Bit values from the "reserved" positions are
|
---|
| 522 | simply assigned to C<undef>, a convenient notation for "I don't care where
|
---|
| 523 | this goes".
|
---|
| 524 |
|
---|
| 525 | ($carry, undef, $parity, undef, $auxcarry, undef, $zero, $sign,
|
---|
| 526 | $trace, $interrupt, $direction, $overflow) =
|
---|
| 527 | split( //, unpack( 'b16', $status ) );
|
---|
| 528 |
|
---|
| 529 | We could have used an unpack template C<'b12'> just as well, since the
|
---|
| 530 | last 4 bits can be ignored anyway.
|
---|
| 531 |
|
---|
| 532 |
|
---|
| 533 | =head2 Uuencoding
|
---|
| 534 |
|
---|
| 535 | Another odd-man-out in the template alphabet is C<u>, which packs an
|
---|
| 536 | "uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that
|
---|
| 537 | you won't ever need this encoding technique which was invented to overcome
|
---|
| 538 | the shortcomings of old-fashioned transmission mediums that do not support
|
---|
| 539 | other than simple ASCII data. The essential recipe is simple: Take three
|
---|
| 540 | bytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to
|
---|
| 541 | each. Repeat until all of the data is blended. Fold groups of 4 bytes into
|
---|
| 542 | lines no longer than 60 and garnish them in front with the original byte count
|
---|
| 543 | (incremented by 0x20) and a C<"\n"> at the end. - The C<pack> chef will
|
---|
| 544 | prepare this for you, a la minute, when you select pack code C<u> on the menu:
|
---|
| 545 |
|
---|
| 546 | my $uubuf = pack( 'u', $bindat );
|
---|
| 547 |
|
---|
| 548 | A repeat count after C<u> sets the number of bytes to put into an
|
---|
| 549 | uuencoded line, which is the maximum of 45 by default, but could be
|
---|
| 550 | set to some (smaller) integer multiple of three. C<unpack> simply ignores
|
---|
| 551 | the repeat count.
|
---|
| 552 |
|
---|
| 553 |
|
---|
| 554 | =head2 Doing Sums
|
---|
| 555 |
|
---|
| 556 | An even stranger template code is C<%>E<lt>I<number>E<gt>. First, because
|
---|
| 557 | it's used as a prefix to some other template code. Second, because it
|
---|
| 558 | cannot be used in C<pack> at all, and third, in C<unpack>, doesn't return the
|
---|
| 559 | data as defined by the template code it precedes. Instead it'll give you an
|
---|
| 560 | integer of I<number> bits that is computed from the data value by
|
---|
| 561 | doing sums. For numeric unpack codes, no big feat is achieved:
|
---|
| 562 |
|
---|
| 563 | my $buf = pack( 'iii', 100, 20, 3 );
|
---|
| 564 | print unpack( '%32i3', $buf ), "\n"; # prints 123
|
---|
| 565 |
|
---|
| 566 | For string values, C<%> returns the sum of the byte values saving
|
---|
| 567 | you the trouble of a sum loop with C<substr> and C<ord>:
|
---|
| 568 |
|
---|
| 569 | print unpack( '%32A*', "\x01\x10" ), "\n"; # prints 17
|
---|
| 570 |
|
---|
| 571 | Although the C<%> code is documented as returning a "checksum":
|
---|
| 572 | don't put your trust in such values! Even when applied to a small number
|
---|
| 573 | of bytes, they won't guarantee a noticeable Hamming distance.
|
---|
| 574 |
|
---|
| 575 | In connection with C<b> or C<B>, C<%> simply adds bits, and this can be put
|
---|
| 576 | to good use to count set bits efficiently:
|
---|
| 577 |
|
---|
| 578 | my $bitcount = unpack( '%32b*', $mask );
|
---|
| 579 |
|
---|
| 580 | And an even parity bit can be determined like this:
|
---|
| 581 |
|
---|
| 582 | my $evenparity = unpack( '%1b*', $mask );
|
---|
| 583 |
|
---|
| 584 |
|
---|
| 585 | =head2 Unicode
|
---|
| 586 |
|
---|
| 587 | Unicode is a character set that can represent most characters in most of
|
---|
| 588 | the world's languages, providing room for over one million different
|
---|
| 589 | characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin
|
---|
| 590 | characters are assigned to the numbers 0 - 127. The Latin-1 Supplement with
|
---|
| 591 | characters that are used in several European languages is in the next
|
---|
| 592 | range, up to 255. After some more Latin extensions we find the character
|
---|
| 593 | sets from languages using non-Roman alphabets, interspersed with a
|
---|
| 594 | variety of symbol sets such as currency symbols, Zapf Dingbats or Braille.
|
---|
| 595 | (You might want to visit L<www.unicode.org> for a look at some of
|
---|
| 596 | them - my personal favourites are Telugu and Kannada.)
|
---|
| 597 |
|
---|
| 598 | The Unicode character sets associates characters with integers. Encoding
|
---|
| 599 | these numbers in an equal number of bytes would more than double the
|
---|
| 600 | requirements for storing texts written in Latin alphabets.
|
---|
| 601 | The UTF-8 encoding avoids this by storing the most common (from a western
|
---|
| 602 | point of view) characters in a single byte while encoding the rarer
|
---|
| 603 | ones in three or more bytes.
|
---|
| 604 |
|
---|
| 605 | So what has this got to do with C<pack>? Well, if you want to convert
|
---|
| 606 | between a Unicode number and its UTF-8 representation you can do so by
|
---|
| 607 | using template code C<U>. As an example, let's produce the UTF-8
|
---|
| 608 | representation of the Euro currency symbol (code number 0x20AC):
|
---|
| 609 |
|
---|
| 610 | $UTF8{Euro} = pack( 'U', 0x20AC );
|
---|
| 611 |
|
---|
| 612 | Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The
|
---|
| 613 | round trip can be completed with C<unpack>:
|
---|
| 614 |
|
---|
| 615 | $Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
|
---|
| 616 |
|
---|
| 617 | Usually you'll want to pack or unpack UTF-8 strings:
|
---|
| 618 |
|
---|
| 619 | # pack and unpack the Hebrew alphabet
|
---|
| 620 | my $alefbet = pack( 'U*', 0x05d0..0x05ea );
|
---|
| 621 | my @hebrew = unpack( 'U*', $utf );
|
---|
| 622 |
|
---|
| 623 |
|
---|
| 624 | =head2 Another Portable Binary Encoding
|
---|
| 625 |
|
---|
| 626 | The pack code C<w> has been added to support a portable binary data
|
---|
| 627 | encoding scheme that goes way beyond simple integers. (Details can
|
---|
| 628 | be found at L<Casbah.org>, the Scarab project.) A BER (Binary Encoded
|
---|
| 629 | Representation) compressed unsigned integer stores base 128
|
---|
| 630 | digits, most significant digit first, with as few digits as possible.
|
---|
| 631 | Bit eight (the high bit) is set on each byte except the last. There
|
---|
| 632 | is no size limit to BER encoding, but Perl won't go to extremes.
|
---|
| 633 |
|
---|
| 634 | my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 );
|
---|
| 635 |
|
---|
| 636 | A hex dump of C<$berbuf>, with spaces inserted at the right places,
|
---|
| 637 | shows 01 8100 8101 81807F. Since the last byte is always less than
|
---|
| 638 | 128, C<unpack> knows where to stop.
|
---|
| 639 |
|
---|
| 640 |
|
---|
| 641 | =head1 Template Grouping
|
---|
| 642 |
|
---|
| 643 | Prior to Perl 5.8, repetitions of templates had to be made by
|
---|
| 644 | C<x>-multiplication of template strings. Now there is a better way as
|
---|
| 645 | we may use the pack codes C<(> and C<)> combined with a repeat count.
|
---|
| 646 | The C<unpack> template from the Stack Frame example can simply
|
---|
| 647 | be written like this:
|
---|
| 648 |
|
---|
| 649 | unpack( 'v2 (vXXCC)5 v5', $frame )
|
---|
| 650 |
|
---|
| 651 | Let's explore this feature a little more. We'll begin with the equivalent of
|
---|
| 652 |
|
---|
| 653 | join( '', map( substr( $_, 0, 1 ), @str ) )
|
---|
| 654 |
|
---|
| 655 | which returns a string consisting of the first character from each string.
|
---|
| 656 | Using pack, we can write
|
---|
| 657 |
|
---|
| 658 | pack( '(A)'.@str, @str )
|
---|
| 659 |
|
---|
| 660 | or, because a repeat count C<*> means "repeat as often as required",
|
---|
| 661 | simply
|
---|
| 662 |
|
---|
| 663 | pack( '(A)*', @str )
|
---|
| 664 |
|
---|
| 665 | (Note that the template C<A*> would only have packed C<$str[0]> in full
|
---|
| 666 | length.)
|
---|
| 667 |
|
---|
| 668 | To pack dates stored as triplets ( day, month, year ) in an array C<@dates>
|
---|
| 669 | into a sequence of byte, byte, short integer we can write
|
---|
| 670 |
|
---|
| 671 | $pd = pack( '(CCS)*', map( @$_, @dates ) );
|
---|
| 672 |
|
---|
| 673 | To swap pairs of characters in a string (with even length) one could use
|
---|
| 674 | several techniques. First, let's use C<x> and C<X> to skip forward and back:
|
---|
| 675 |
|
---|
| 676 | $s = pack( '(A)*', unpack( '(xAXXAx)*', $s ) );
|
---|
| 677 |
|
---|
| 678 | We can also use C<@> to jump to an offset, with 0 being the position where
|
---|
| 679 | we were when the last C<(> was encountered:
|
---|
| 680 |
|
---|
| 681 | $s = pack( '(A)*', unpack( '(@1A @0A @2)*', $s ) );
|
---|
| 682 |
|
---|
| 683 | Finally, there is also an entirely different approach by unpacking big
|
---|
| 684 | endian shorts and packing them in the reverse byte order:
|
---|
| 685 |
|
---|
| 686 | $s = pack( '(v)*', unpack( '(n)*', $s );
|
---|
| 687 |
|
---|
| 688 |
|
---|
| 689 | =head1 Lengths and Widths
|
---|
| 690 |
|
---|
| 691 | =head2 String Lengths
|
---|
| 692 |
|
---|
| 693 | In the previous section we've seen a network message that was constructed
|
---|
| 694 | by prefixing the binary message length to the actual message. You'll find
|
---|
| 695 | that packing a length followed by so many bytes of data is a
|
---|
| 696 | frequently used recipe since appending a null byte won't work
|
---|
| 697 | if a null byte may be part of the data. Here is an example where both
|
---|
| 698 | techniques are used: after two null terminated strings with source and
|
---|
| 699 | destination address, a Short Message (to a mobile phone) is sent after
|
---|
| 700 | a length byte:
|
---|
| 701 |
|
---|
| 702 | my $msg = pack( 'Z*Z*CA*', $src, $dst, length( $sm ), $sm );
|
---|
| 703 |
|
---|
| 704 | Unpacking this message can be done with the same template:
|
---|
| 705 |
|
---|
| 706 | ( $src, $dst, $len, $sm ) = unpack( 'Z*Z*CA*', $msg );
|
---|
| 707 |
|
---|
| 708 | There's a subtle trap lurking in the offing: Adding another field after
|
---|
| 709 | the Short Message (in variable C<$sm>) is all right when packing, but this
|
---|
| 710 | cannot be unpacked naively:
|
---|
| 711 |
|
---|
| 712 | # pack a message
|
---|
| 713 | my $msg = pack( 'Z*Z*CA*C', $src, $dst, length( $sm ), $sm, $prio );
|
---|
| 714 |
|
---|
| 715 | # unpack fails - $prio remains undefined!
|
---|
| 716 | ( $src, $dst, $len, $sm, $prio ) = unpack( 'Z*Z*CA*C', $msg );
|
---|
| 717 |
|
---|
| 718 | The pack code C<A*> gobbles up all remaining bytes, and C<$prio> remains
|
---|
| 719 | undefined! Before we let disappointment dampen the morale: Perl's got
|
---|
| 720 | the trump card to make this trick too, just a little further up the sleeve.
|
---|
| 721 | Watch this:
|
---|
| 722 |
|
---|
| 723 | # pack a message: ASCIIZ, ASCIIZ, length/string, byte
|
---|
| 724 | my $msg = pack( 'Z* Z* C/A* C', $src, $dst, $sm, $prio );
|
---|
| 725 |
|
---|
| 726 | # unpack
|
---|
| 727 | ( $src, $dst, $sm, $prio ) = unpack( 'Z* Z* C/A* C', $msg );
|
---|
| 728 |
|
---|
| 729 | Combining two pack codes with a slash (C</>) associates them with a single
|
---|
| 730 | value from the argument list. In C<pack>, the length of the argument is
|
---|
| 731 | taken and packed according to the first code while the argument itself
|
---|
| 732 | is added after being converted with the template code after the slash.
|
---|
| 733 | This saves us the trouble of inserting the C<length> call, but it is
|
---|
| 734 | in C<unpack> where we really score: The value of the length byte marks the
|
---|
| 735 | end of the string to be taken from the buffer. Since this combination
|
---|
| 736 | doesn't make sense except when the second pack code isn't C<a*>, C<A*>
|
---|
| 737 | or C<Z*>, Perl won't let you.
|
---|
| 738 |
|
---|
| 739 | The pack code preceding C</> may be anything that's fit to represent a
|
---|
| 740 | number: All the numeric binary pack codes, and even text codes such as
|
---|
| 741 | C<A4> or C<Z*>:
|
---|
| 742 |
|
---|
| 743 | # pack/unpack a string preceded by its length in ASCII
|
---|
| 744 | my $buf = pack( 'A4/A*', "Humpty-Dumpty" );
|
---|
| 745 | # unpack $buf: '13 Humpty-Dumpty'
|
---|
| 746 | my $txt = unpack( 'A4/A*', $buf );
|
---|
| 747 |
|
---|
| 748 | C</> is not implemented in Perls before 5.6, so if your code is required to
|
---|
| 749 | work on older Perls you'll need to C<unpack( 'Z* Z* C')> to get the length,
|
---|
| 750 | then use it to make a new unpack string. For example
|
---|
| 751 |
|
---|
| 752 | # pack a message: ASCIIZ, ASCIIZ, length, string, byte (5.005 compatible)
|
---|
| 753 | my $msg = pack( 'Z* Z* C A* C', $src, $dst, length $sm, $sm, $prio );
|
---|
| 754 |
|
---|
| 755 | # unpack
|
---|
| 756 | ( undef, undef, $len) = unpack( 'Z* Z* C', $msg );
|
---|
| 757 | ($src, $dst, $sm, $prio) = unpack ( "Z* Z* x A$len C", $msg );
|
---|
| 758 |
|
---|
| 759 | But that second C<unpack> is rushing ahead. It isn't using a simple literal
|
---|
| 760 | string for the template. So maybe we should introduce...
|
---|
| 761 |
|
---|
| 762 | =head2 Dynamic Templates
|
---|
| 763 |
|
---|
| 764 | So far, we've seen literals used as templates. If the list of pack
|
---|
| 765 | items doesn't have fixed length, an expression constructing the
|
---|
| 766 | template is required (whenever, for some reason, C<()*> cannot be used).
|
---|
| 767 | Here's an example: To store named string values in a way that can be
|
---|
| 768 | conveniently parsed by a C program, we create a sequence of names and
|
---|
| 769 | null terminated ASCII strings, with C<=> between the name and the value,
|
---|
| 770 | followed by an additional delimiting null byte. Here's how:
|
---|
| 771 |
|
---|
| 772 | my $env = pack( '(A*A*Z*)' . keys( %Env ) . 'C',
|
---|
| 773 | map( { ( $_, '=', $Env{$_} ) } keys( %Env ) ), 0 );
|
---|
| 774 |
|
---|
| 775 | Let's examine the cogs of this byte mill, one by one. There's the C<map>
|
---|
| 776 | call, creating the items we intend to stuff into the C<$env> buffer:
|
---|
| 777 | to each key (in C<$_>) it adds the C<=> separator and the hash entry value.
|
---|
| 778 | Each triplet is packed with the template code sequence C<A*A*Z*> that
|
---|
| 779 | is repeated according to the number of keys. (Yes, that's what the C<keys>
|
---|
| 780 | function returns in scalar context.) To get the very last null byte,
|
---|
| 781 | we add a C<0> at the end of the C<pack> list, to be packed with C<C>.
|
---|
| 782 | (Attentive readers may have noticed that we could have omitted the 0.)
|
---|
| 783 |
|
---|
| 784 | For the reverse operation, we'll have to determine the number of items
|
---|
| 785 | in the buffer before we can let C<unpack> rip it apart:
|
---|
| 786 |
|
---|
| 787 | my $n = $env =~ tr/\0// - 1;
|
---|
| 788 | my %env = map( split( /=/, $_ ), unpack( "(Z*)$n", $env ) );
|
---|
| 789 |
|
---|
| 790 | The C<tr> counts the null bytes. The C<unpack> call returns a list of
|
---|
| 791 | name-value pairs each of which is taken apart in the C<map> block.
|
---|
| 792 |
|
---|
| 793 |
|
---|
| 794 | =head2 Counting Repetitions
|
---|
| 795 |
|
---|
| 796 | Rather than storing a sentinel at the end of a data item (or a list of items),
|
---|
| 797 | we could precede the data with a count. Again, we pack keys and values of
|
---|
| 798 | a hash, preceding each with an unsigned short length count, and up front
|
---|
| 799 | we store the number of pairs:
|
---|
| 800 |
|
---|
| 801 | my $env = pack( 'S(S/A* S/A*)*', scalar keys( %Env ), %Env );
|
---|
| 802 |
|
---|
| 803 | This simplifies the reverse operation as the number of repetitions can be
|
---|
| 804 | unpacked with the C</> code:
|
---|
| 805 |
|
---|
| 806 | my %env = unpack( 'S/(S/A* S/A*)', $env );
|
---|
| 807 |
|
---|
| 808 | Note that this is one of the rare cases where you cannot use the same
|
---|
| 809 | template for C<pack> and C<unpack> because C<pack> can't determine
|
---|
| 810 | a repeat count for a C<()>-group.
|
---|
| 811 |
|
---|
| 812 |
|
---|
| 813 | =head1 Packing and Unpacking C Structures
|
---|
| 814 |
|
---|
| 815 | In previous sections we have seen how to pack numbers and character
|
---|
| 816 | strings. If it were not for a couple of snags we could conclude this
|
---|
| 817 | section right away with the terse remark that C structures don't
|
---|
| 818 | contain anything else, and therefore you already know all there is to it.
|
---|
| 819 | Sorry, no: read on, please.
|
---|
| 820 |
|
---|
| 821 | =head2 The Alignment Pit
|
---|
| 822 |
|
---|
| 823 | In the consideration of speed against memory requirements the balance
|
---|
| 824 | has been tilted in favor of faster execution. This has influenced the
|
---|
| 825 | way C compilers allocate memory for structures: On architectures
|
---|
| 826 | where a 16-bit or 32-bit operand can be moved faster between places in
|
---|
| 827 | memory, or to or from a CPU register, if it is aligned at an even or
|
---|
| 828 | multiple-of-four or even at a multiple-of eight address, a C compiler
|
---|
| 829 | will give you this speed benefit by stuffing extra bytes into structures.
|
---|
| 830 | If you don't cross the C shoreline this is not likely to cause you any
|
---|
| 831 | grief (although you should care when you design large data structures,
|
---|
| 832 | or you want your code to be portable between architectures (you do want
|
---|
| 833 | that, don't you?)).
|
---|
| 834 |
|
---|
| 835 | To see how this affects C<pack> and C<unpack>, we'll compare these two
|
---|
| 836 | C structures:
|
---|
| 837 |
|
---|
| 838 | typedef struct {
|
---|
| 839 | char c1;
|
---|
| 840 | short s;
|
---|
| 841 | char c2;
|
---|
| 842 | long l;
|
---|
| 843 | } gappy_t;
|
---|
| 844 |
|
---|
| 845 | typedef struct {
|
---|
| 846 | long l;
|
---|
| 847 | short s;
|
---|
| 848 | char c1;
|
---|
| 849 | char c2;
|
---|
| 850 | } dense_t;
|
---|
| 851 |
|
---|
| 852 | Typically, a C compiler allocates 12 bytes to a C<gappy_t> variable, but
|
---|
| 853 | requires only 8 bytes for a C<dense_t>. After investigating this further,
|
---|
| 854 | we can draw memory maps, showing where the extra 4 bytes are hidden:
|
---|
| 855 |
|
---|
| 856 | 0 +4 +8 +12
|
---|
| 857 | +--+--+--+--+--+--+--+--+--+--+--+--+
|
---|
| 858 | |c1|xx| s |c2|xx|xx|xx| l | xx = fill byte
|
---|
| 859 | +--+--+--+--+--+--+--+--+--+--+--+--+
|
---|
| 860 | gappy_t
|
---|
| 861 |
|
---|
| 862 | 0 +4 +8
|
---|
| 863 | +--+--+--+--+--+--+--+--+
|
---|
| 864 | | l | h |c1|c2|
|
---|
| 865 | +--+--+--+--+--+--+--+--+
|
---|
| 866 | dense_t
|
---|
| 867 |
|
---|
| 868 | And that's where the first quirk strikes: C<pack> and C<unpack>
|
---|
| 869 | templates have to be stuffed with C<x> codes to get those extra fill bytes.
|
---|
| 870 |
|
---|
| 871 | The natural question: "Why can't Perl compensate for the gaps?" warrants
|
---|
| 872 | an answer. One good reason is that C compilers might provide (non-ANSI)
|
---|
| 873 | extensions permitting all sorts of fancy control over the way structures
|
---|
| 874 | are aligned, even at the level of an individual structure field. And, if
|
---|
| 875 | this were not enough, there is an insidious thing called C<union> where
|
---|
| 876 | the amount of fill bytes cannot be derived from the alignment of the next
|
---|
| 877 | item alone.
|
---|
| 878 |
|
---|
| 879 | OK, so let's bite the bullet. Here's one way to get the alignment right
|
---|
| 880 | by inserting template codes C<x>, which don't take a corresponding item
|
---|
| 881 | from the list:
|
---|
| 882 |
|
---|
| 883 | my $gappy = pack( 'cxs cxxx l!', $c1, $s, $c2, $l );
|
---|
| 884 |
|
---|
| 885 | Note the C<!> after C<l>: We want to make sure that we pack a long
|
---|
| 886 | integer as it is compiled by our C compiler. And even now, it will only
|
---|
| 887 | work for the platforms where the compiler aligns things as above.
|
---|
| 888 | And somebody somewhere has a platform where it doesn't.
|
---|
| 889 | [Probably a Cray, where C<short>s, C<int>s and C<long>s are all 8 bytes. :-)]
|
---|
| 890 |
|
---|
| 891 | Counting bytes and watching alignments in lengthy structures is bound to
|
---|
| 892 | be a drag. Isn't there a way we can create the template with a simple
|
---|
| 893 | program? Here's a C program that does the trick:
|
---|
| 894 |
|
---|
| 895 | #include <stdio.h>
|
---|
| 896 | #include <stddef.h>
|
---|
| 897 |
|
---|
| 898 | typedef struct {
|
---|
| 899 | char fc1;
|
---|
| 900 | short fs;
|
---|
| 901 | char fc2;
|
---|
| 902 | long fl;
|
---|
| 903 | } gappy_t;
|
---|
| 904 |
|
---|
| 905 | #define Pt(struct,field,tchar) \
|
---|
| 906 | printf( "@%d%s ", offsetof(struct,field), # tchar );
|
---|
| 907 |
|
---|
| 908 | int main() {
|
---|
| 909 | Pt( gappy_t, fc1, c );
|
---|
| 910 | Pt( gappy_t, fs, s! );
|
---|
| 911 | Pt( gappy_t, fc2, c );
|
---|
| 912 | Pt( gappy_t, fl, l! );
|
---|
| 913 | printf( "\n" );
|
---|
| 914 | }
|
---|
| 915 |
|
---|
| 916 | The output line can be used as a template in a C<pack> or C<unpack> call:
|
---|
| 917 |
|
---|
| 918 | my $gappy = pack( '@0c @2s! @4c @8l!', $c1, $s, $c2, $l );
|
---|
| 919 |
|
---|
| 920 | Gee, yet another template code - as if we hadn't plenty. But
|
---|
| 921 | C<@> saves our day by enabling us to specify the offset from the beginning
|
---|
| 922 | of the pack buffer to the next item: This is just the value
|
---|
| 923 | the C<offsetof> macro (defined in C<E<lt>stddef.hE<gt>>) returns when
|
---|
| 924 | given a C<struct> type and one of its field names ("member-designator" in
|
---|
| 925 | C standardese).
|
---|
| 926 |
|
---|
| 927 | Neither using offsets nor adding C<x>'s to bridge the gaps is satisfactory.
|
---|
| 928 | (Just imagine what happens if the structure changes.) What we really need
|
---|
| 929 | is a way of saying "skip as many bytes as required to the next multiple of N".
|
---|
| 930 | In fluent Templatese, you say this with C<x!N> where N is replaced by the
|
---|
| 931 | appropriate value. Here's the next version of our struct packaging:
|
---|
| 932 |
|
---|
| 933 | my $gappy = pack( 'c x!2 s c x!4 l!', $c1, $s, $c2, $l );
|
---|
| 934 |
|
---|
| 935 | That's certainly better, but we still have to know how long all the
|
---|
| 936 | integers are, and portability is far away. Rather than C<2>,
|
---|
| 937 | for instance, we want to say "however long a short is". But this can be
|
---|
| 938 | done by enclosing the appropriate pack code in brackets: C<[s]>. So, here's
|
---|
| 939 | the very best we can do:
|
---|
| 940 |
|
---|
| 941 | my $gappy = pack( 'c x![s] s c x![l!] l!', $c1, $s, $c2, $l );
|
---|
| 942 |
|
---|
| 943 |
|
---|
| 944 | =head2 Alignment, Take 2
|
---|
| 945 |
|
---|
| 946 | I'm afraid that we're not quite through with the alignment catch yet. The
|
---|
| 947 | hydra raises another ugly head when you pack arrays of structures:
|
---|
| 948 |
|
---|
| 949 | typedef struct {
|
---|
| 950 | short count;
|
---|
| 951 | char glyph;
|
---|
| 952 | } cell_t;
|
---|
| 953 |
|
---|
| 954 | typedef cell_t buffer_t[BUFLEN];
|
---|
| 955 |
|
---|
| 956 | Where's the catch? Padding is neither required before the first field C<count>,
|
---|
| 957 | nor between this and the next field C<glyph>, so why can't we simply pack
|
---|
| 958 | like this:
|
---|
| 959 |
|
---|
| 960 | # something goes wrong here:
|
---|
| 961 | pack( 's!a' x @buffer,
|
---|
| 962 | map{ ( $_->{count}, $_->{glyph} ) } @buffer );
|
---|
| 963 |
|
---|
| 964 | This packs C<3*@buffer> bytes, but it turns out that the size of
|
---|
| 965 | C<buffer_t> is four times C<BUFLEN>! The moral of the story is that
|
---|
| 966 | the required alignment of a structure or array is propagated to the
|
---|
| 967 | next higher level where we have to consider padding I<at the end>
|
---|
| 968 | of each component as well. Thus the correct template is:
|
---|
| 969 |
|
---|
| 970 | pack( 's!ax' x @buffer,
|
---|
| 971 | map{ ( $_->{count}, $_->{glyph} ) } @buffer );
|
---|
| 972 |
|
---|
| 973 | =head2 Alignment, Take 3
|
---|
| 974 |
|
---|
| 975 | And even if you take all the above into account, ANSI still lets this:
|
---|
| 976 |
|
---|
| 977 | typedef struct {
|
---|
| 978 | char foo[2];
|
---|
| 979 | } foo_t;
|
---|
| 980 |
|
---|
| 981 | vary in size. The alignment constraint of the structure can be greater than
|
---|
| 982 | any of its elements. [And if you think that this doesn't affect anything
|
---|
| 983 | common, dismember the next cellphone that you see. Many have ARM cores, and
|
---|
| 984 | the ARM structure rules make C<sizeof (foo_t)> == 4]
|
---|
| 985 |
|
---|
| 986 | =head2 Pointers for How to Use Them
|
---|
| 987 |
|
---|
| 988 | The title of this section indicates the second problem you may run into
|
---|
| 989 | sooner or later when you pack C structures. If the function you intend
|
---|
| 990 | to call expects a, say, C<void *> value, you I<cannot> simply take
|
---|
| 991 | a reference to a Perl variable. (Although that value certainly is a
|
---|
| 992 | memory address, it's not the address where the variable's contents are
|
---|
| 993 | stored.)
|
---|
| 994 |
|
---|
| 995 | Template code C<P> promises to pack a "pointer to a fixed length string".
|
---|
| 996 | Isn't this what we want? Let's try:
|
---|
| 997 |
|
---|
| 998 | # allocate some storage and pack a pointer to it
|
---|
| 999 | my $memory = "\x00" x $size;
|
---|
| 1000 | my $memptr = pack( 'P', $memory );
|
---|
| 1001 |
|
---|
| 1002 | But wait: doesn't C<pack> just return a sequence of bytes? How can we pass this
|
---|
| 1003 | string of bytes to some C code expecting a pointer which is, after all,
|
---|
| 1004 | nothing but a number? The answer is simple: We have to obtain the numeric
|
---|
| 1005 | address from the bytes returned by C<pack>.
|
---|
| 1006 |
|
---|
| 1007 | my $ptr = unpack( 'L!', $memptr );
|
---|
| 1008 |
|
---|
| 1009 | Obviously this assumes that it is possible to typecast a pointer
|
---|
| 1010 | to an unsigned long and vice versa, which frequently works but should not
|
---|
| 1011 | be taken as a universal law. - Now that we have this pointer the next question
|
---|
| 1012 | is: How can we put it to good use? We need a call to some C function
|
---|
| 1013 | where a pointer is expected. The read(2) system call comes to mind:
|
---|
| 1014 |
|
---|
| 1015 | ssize_t read(int fd, void *buf, size_t count);
|
---|
| 1016 |
|
---|
| 1017 | After reading L<perlfunc> explaining how to use C<syscall> we can write
|
---|
| 1018 | this Perl function copying a file to standard output:
|
---|
| 1019 |
|
---|
| 1020 | require 'syscall.ph';
|
---|
| 1021 | sub cat($){
|
---|
| 1022 | my $path = shift();
|
---|
| 1023 | my $size = -s $path;
|
---|
| 1024 | my $memory = "\x00" x $size; # allocate some memory
|
---|
| 1025 | my $ptr = unpack( 'L', pack( 'P', $memory ) );
|
---|
| 1026 | open( F, $path ) || die( "$path: cannot open ($!)\n" );
|
---|
| 1027 | my $fd = fileno(F);
|
---|
| 1028 | my $res = syscall( &SYS_read, fileno(F), $ptr, $size );
|
---|
| 1029 | print $memory;
|
---|
| 1030 | close( F );
|
---|
| 1031 | }
|
---|
| 1032 |
|
---|
| 1033 | This is neither a specimen of simplicity nor a paragon of portability but
|
---|
| 1034 | it illustrates the point: We are able to sneak behind the scenes and
|
---|
| 1035 | access Perl's otherwise well-guarded memory! (Important note: Perl's
|
---|
| 1036 | C<syscall> does I<not> require you to construct pointers in this roundabout
|
---|
| 1037 | way. You simply pass a string variable, and Perl forwards the address.)
|
---|
| 1038 |
|
---|
| 1039 | How does C<unpack> with C<P> work? Imagine some pointer in the buffer
|
---|
| 1040 | about to be unpacked: If it isn't the null pointer (which will smartly
|
---|
| 1041 | produce the C<undef> value) we have a start address - but then what?
|
---|
| 1042 | Perl has no way of knowing how long this "fixed length string" is, so
|
---|
| 1043 | it's up to you to specify the actual size as an explicit length after C<P>.
|
---|
| 1044 |
|
---|
| 1045 | my $mem = "abcdefghijklmn";
|
---|
| 1046 | print unpack( 'P5', pack( 'P', $mem ) ); # prints "abcde"
|
---|
| 1047 |
|
---|
| 1048 | As a consequence, C<pack> ignores any number or C<*> after C<P>.
|
---|
| 1049 |
|
---|
| 1050 |
|
---|
| 1051 | Now that we have seen C<P> at work, we might as well give C<p> a whirl.
|
---|
| 1052 | Why do we need a second template code for packing pointers at all? The
|
---|
| 1053 | answer lies behind the simple fact that an C<unpack> with C<p> promises
|
---|
| 1054 | a null-terminated string starting at the address taken from the buffer,
|
---|
| 1055 | and that implies a length for the data item to be returned:
|
---|
| 1056 |
|
---|
| 1057 | my $buf = pack( 'p', "abc\x00efhijklmn" );
|
---|
| 1058 | print unpack( 'p', $buf ); # prints "abc"
|
---|
| 1059 |
|
---|
| 1060 |
|
---|
| 1061 |
|
---|
| 1062 | Albeit this is apt to be confusing: As a consequence of the length being
|
---|
| 1063 | implied by the string's length, a number after pack code C<p> is a repeat
|
---|
| 1064 | count, not a length as after C<P>.
|
---|
| 1065 |
|
---|
| 1066 |
|
---|
| 1067 | Using C<pack(..., $x)> with C<P> or C<p> to get the address where C<$x> is
|
---|
| 1068 | actually stored must be used with circumspection. Perl's internal machinery
|
---|
| 1069 | considers the relation between a variable and that address as its very own
|
---|
| 1070 | private matter and doesn't really care that we have obtained a copy. Therefore:
|
---|
| 1071 |
|
---|
| 1072 | =over 4
|
---|
| 1073 |
|
---|
| 1074 | =item *
|
---|
| 1075 |
|
---|
| 1076 | Do not use C<pack> with C<p> or C<P> to obtain the address of variable
|
---|
| 1077 | that's bound to go out of scope (and thereby freeing its memory) before you
|
---|
| 1078 | are done with using the memory at that address.
|
---|
| 1079 |
|
---|
| 1080 | =item *
|
---|
| 1081 |
|
---|
| 1082 | Be very careful with Perl operations that change the value of the
|
---|
| 1083 | variable. Appending something to the variable, for instance, might require
|
---|
| 1084 | reallocation of its storage, leaving you with a pointer into no-man's land.
|
---|
| 1085 |
|
---|
| 1086 | =item *
|
---|
| 1087 |
|
---|
| 1088 | Don't think that you can get the address of a Perl variable
|
---|
| 1089 | when it is stored as an integer or double number! C<pack('P', $x)> will
|
---|
| 1090 | force the variable's internal representation to string, just as if you
|
---|
| 1091 | had written something like C<$x .= ''>.
|
---|
| 1092 |
|
---|
| 1093 | =back
|
---|
| 1094 |
|
---|
| 1095 | It's safe, however, to P- or p-pack a string literal, because Perl simply
|
---|
| 1096 | allocates an anonymous variable.
|
---|
| 1097 |
|
---|
| 1098 |
|
---|
| 1099 |
|
---|
| 1100 | =head1 Pack Recipes
|
---|
| 1101 |
|
---|
| 1102 | Here are a collection of (possibly) useful canned recipes for C<pack>
|
---|
| 1103 | and C<unpack>:
|
---|
| 1104 |
|
---|
| 1105 | # Convert IP address for socket functions
|
---|
| 1106 | pack( "C4", split /\./, "123.4.5.6" );
|
---|
| 1107 |
|
---|
| 1108 | # Count the bits in a chunk of memory (e.g. a select vector)
|
---|
| 1109 | unpack( '%32b*', $mask );
|
---|
| 1110 |
|
---|
| 1111 | # Determine the endianness of your system
|
---|
| 1112 | $is_little_endian = unpack( 'c', pack( 's', 1 ) );
|
---|
| 1113 | $is_big_endian = unpack( 'xc', pack( 's', 1 ) );
|
---|
| 1114 |
|
---|
| 1115 | # Determine the number of bits in a native integer
|
---|
| 1116 | $bits = unpack( '%32I!', ~0 );
|
---|
| 1117 |
|
---|
| 1118 | # Prepare argument for the nanosleep system call
|
---|
| 1119 | my $timespec = pack( 'L!L!', $secs, $nanosecs );
|
---|
| 1120 |
|
---|
| 1121 | For a simple memory dump we unpack some bytes into just as
|
---|
| 1122 | many pairs of hex digits, and use C<map> to handle the traditional
|
---|
| 1123 | spacing - 16 bytes to a line:
|
---|
| 1124 |
|
---|
| 1125 | my $i;
|
---|
| 1126 | print map( ++$i % 16 ? "$_ " : "$_\n",
|
---|
| 1127 | unpack( 'H2' x length( $mem ), $mem ) ),
|
---|
| 1128 | length( $mem ) % 16 ? "\n" : '';
|
---|
| 1129 |
|
---|
| 1130 |
|
---|
| 1131 | =head1 Funnies Section
|
---|
| 1132 |
|
---|
| 1133 | # Pulling digits out of nowhere...
|
---|
| 1134 | print unpack( 'C', pack( 'x' ) ),
|
---|
| 1135 | unpack( '%B*', pack( 'A' ) ),
|
---|
| 1136 | unpack( 'H', pack( 'A' ) ),
|
---|
| 1137 | unpack( 'A', unpack( 'C', pack( 'A' ) ) ), "\n";
|
---|
| 1138 |
|
---|
| 1139 | # One for the road ;-)
|
---|
| 1140 | my $advice = pack( 'all u can in a van' );
|
---|
| 1141 |
|
---|
| 1142 |
|
---|
| 1143 | =head1 Authors
|
---|
| 1144 |
|
---|
| 1145 | Simon Cozens and Wolfgang Laun.
|
---|
| 1146 |
|
---|