Vous êtes sur la page 1sur 14

Endianness From Wikipedia, the free encyclopedia Jump to: navigation, search "Endian" redirects here.

For the Linux routing and firewall distribution, see Endian Firewall. For Jonathan Swift's Big-Endian and Little-Endian parties in Lilliput, see Lilliput and Blefuscu. In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory (or, sometimes, as sent on a serial connection). Each sub-component in the representation has a unique degree of significance, like the place value of digits in a decimal number. These sub-components are typically 16-, 32- or 64-bit words, 8bit bytes, or even bits. Endianness is a difference in data representation at the hardware level and may or may not be transparent at higher levels, depending on factors such as the type of high level language used. The most common cases refer to how bytes are ordered within a single 16-, 32-, or 64-bit word, and endianness is then the same as byte order.[1] The usual contrast is whether the most significant or least significant byte is ordered firsti.e., at the lowest byte addresswithin the larger data item. A big-endian machine stores the most significant byte first, and a little-endian machine stores the least significant byte first. In these standard forms, the bytes remain ordered by significance. However, mixed forms are also possible where the ordering of bytes within a 16-bit word may differ from the ordering of 16-bit words within a 32-bit word, for instance. Although rare, such cases do exist and may sometimes be referred to as mixed-endian or middle-endian. Endianness is important as a low-level attribute of a particular data format. For example, the order in which the two bytes of a UCS-2 character are stored in memory is of considerable importance in network programming where two computers with different byte orders may be communicating with each other. Failure to account for a varying endianness across architectures when writing code for mixed platforms leads to failures and bugs that can be difficult to detect. First byte Last byte Middle Endian (lowest (highest Notes bytes address) address) Similar to a number written on paper (in Arabic numerals as used in most Western least significant big most significant ... scripts) most significant Arithmetic calculation order (see carry propagation) little least significant ... The term endian as derived from end may lead to confusion. Note that it denotes which end (i.e., outermost part) of the number comes first rather than which part comes at the end of the byte sequence (representing the number). See Etymology for an explanation why it refers to the end from which something starts, but not to something coming last.

Endianness and hardware The full register width among different CPUs and other processor types varies widely (typically between 4 and 64 bits [citation needed]). The internal bit-, byte-, or word-ordering within such a register is normally not considered "endianness", despite the fact that some CPU instructions may address individual bits (or other parts) using various kinds of internal addressing schemes. The "endianness" only describes how the bits are organized as seen from the outside (i.e., when stored in memory). The fact that some assembly languages label bits in an unorthodox manner is also largely another matter (a few architectures/assemblers turn the conventional msb..lsb = D31..D0 the other way round, so that msb=D0). Large integers are usually stored in memory as a sequence of smaller ones and obtained by simple concatenation. The simple forms are: increasing numeric significance with increasing memory addresses (or increasing time), known as little-endian, and decreasing numeric significance with increasing memory addresses (or increasing time), known as big-endian[2] Well-known processor architectures that use the little-endian format include x86 (including x86-64), 6502 (including 65802, 65C816), Z80 (including Z180, eZ80 etc.), MCS-48, 8051, DEC Alpha, Altera Nios, Atmel AVR, SuperH, VAX, and, largely, PDP-11. Well-known processors that use the big-endian format include Motorola 6800 and 68k, Xilinx Microblaze, IBM POWER, and System/360 and its successors such as System/370, ESA/390, and z/Architecture. The PDP-10 also used big-endian addressing for byte-oriented instructions. SPARC historically used big-endian until version 9, which is bi-endian, similarly the ARM architecture was little-endian before version 3 when it became bi-endian, and the PowerPC and Power Architecture descendants of IBM POWER are also bi-endian (see below). Serial protocols may also be regarded as either little or big-endian at the bit- and/or byte-levels (which may differ). Many serial interfaces, such as the ubiquitous USB, are little-endian at the bit-level. Physical standards like RS-232, RS-422 and RS-485 are also typically used with UARTs that send the least significant bit first, such as in industrial instrumentation applications, lighting protocols (DMX512), and so on. The same could be said for digital current loop signaling systems such as MIDI. There are also several serial formats where the most significant bit is normally sent first, such as IC and the related SMBus. However, the bit order may often be reversed (or is "transparent") in the interface between the UART or communication controller and the host CPU or DMA controller (and/or system memory), especially in more complex systems and personal computers. These interfaces may be of any type and are often configurable. Bi-endian hardware Some architectures (including ARM versions 3 and above, PowerPC, Alpha, SPARC V9, MIPS, PA-RISC, SuperH SH-4 and IA-64) feature a setting which allows for switchable endianness in data segments, code segments or both. This feature can improve performance or simplify the logic of networking devices and software. The word bi-endian, when said of hardware, denotes the capability of the machine to compute or pass data in either endian format. Many of these architectures can be switched via software to default to a specific endian format (usually done when the computer starts up); however, on some systems the default endianness is selected by hardware on the motherboard and cannot be changed via software (e.g., the Alpha, which runs only in big-endian mode on the Cray T3E). Note that the term "bi-endian" refers primarily to how a processor treats data accesses. Instruction accesses (fetches of instruction words) on a given processor may still assume a fixed endianness, even if data accesses are fully bi-endian, though this is not always the case, such as on Intel's IA-64-based Itanium CPU, which allows both. Note, too, that some nominally bi-endian CPUs require motherboard help to fully switch endianness. For instance, the 32-bit desktop-oriented PowerPC processors in little-endian mode act as little-endian from the point of view of the executing programs but they require the motherboard to perform a 64-bit swap across all 8 byte lanes to ensure that the little-endian view of things will apply to I/O devices. In the absence of this unusual motherboard hardware, device driver software must write to different addresses to undo the incomplete transformation and also must perform a normal byte swap. Some CPUs, such as many PowerPC processors intended for embedded use, allow per-page choice of endianness.

Floating-point and endianness Although the ubiquitous x86 of today use little-endian storage for all types of data (integer, floating point, BCD), there have been a few historical machines where floating point numbers were represented in big-endian form while integers were represented in little-endian form.[3] Ancient ARM processors even have half little-endian, half big-endian floating point representation! Because there have been many floating point formats with no "network" standard representation for them, there is no formal standard for transferring floating point values between heterogeneous systems. It may therefore appear strange that the widespread IEEE 754 floating point standard does not specify endianness.[4] Theoretically, this means that even standard IEEE floating point data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may in practice safely assume that the endianness is the same for floating point numbers as for integers, making the conversion straightforward regardless of data type. (Small embedded systems using special floating point formats may be another matter however.) Endianness and operating systems on architectures Little-endian (operating systems with architectures on Little Endian): DragonFlyBSD on x86, and x86-64 FreeBSD on x86, x86-64, and Itanium Linux on x86, x86-64, MIPSEL, Alpha, Itanium, S+core, MN103, CRIS, Blackfin, MicroblazeEL, ARM, M32REL, TILE, SH, XtensaEL and UniCore32 Mac OS X on x86, x86-64 iOS on ARM NetBSD on x86, x86-64, Itanium, etc. OpenBSD on x86, x86-64, Alpha, VAX, Loongson (MIPSEL) OpenVMS on VAX, Alpha and Itanium Solaris on x86, x86-64, PowerPC Tru64 UNIX on Alpha ESX on x86, x86-64 Windows on x86, x86-64, Alpha, PowerPC, MIPS and Itanium Big-endian (operating systems with architectures on Big Endian): AIX on POWER AmigaOS on PowerPC and 680x0 FreeBSD on PowerPC and SPARC HP-UX on Itanium and PA-RISC Linux on MIPS, SPARC, PA-RISC, POWER, PowerPC, 680x0, ESA/390, z/Architecture, H8, FR-V, AVR32, Microblaze, ARMEB, M32R, SHEB and Xtensa. Mac OS on PowerPC and 680x0 Mac OS X on PowerPC NetBSD on PowerPC, SPARC, etc. OpenBSD on PowerPC, SPARC, PA-RISC, SGI (MIPSEB), Motorola 68k and 88k, Landisk (SuperH-4) MVS and DOS/VSE on ESA/390, and z/VSE and z/OS on z/Architecture Solaris on SPARC Etymology An egg in an egg cup with the little-endian portion oriented upward. The term big-endian originally comes from Jonathan Swift's satirical novel Gullivers Travels by way of Danny Cohen in 1980.[5] In 1726, Swift described tensions in Lilliput and Blefuscu: whereas royal edict in Lilliput requires cracking open one's soft-boiled egg at the small end, inhabitants of the rival kingdom of Blefuscu crack theirs at the big end (giving them the moniker Big-endians).[6] The terms little-endian and endianness have a similar intent.[7] "On Holy Wars and a Plea for Peace"[5] by Danny Cohen ends with: "Swift's point is that the difference between breaking the egg at the little-end and breaking it at the big-end is trivial. Therefore, he suggests, that everyone does it in his own preferred way. We agree that the difference between sending eggs with the little- or the big-end first is trivial, but we insist that everyone must do it in the same way, to avoid anarchy. Since the difference is trivial we may choose either way, but a decision must be made." History The problem of dealing with data in different representations is sometimes termed the NUXI problem.[8] This terminology alludes to the issue that a value represented by the byte-string "UNIX" on a big-endian system may be stored as "NUXI" on a PDP-11 middle-endian system; UNIX was one of the first systems to allow the same code to run on, and transfer data between, platforms with different internal representations. An often-cited argument in favor of big-endian is that it is consistent with the ordering commonly used in natural languages. [9] Spoken languages have a wide variety of organizations of numbers: the decimal number 92 is spoken in English as ninety-two, in German and Dutch as two and ninety and in French as eighty-twelve with a similar system in Danish (two-and-half-five-times-twenty). However, numbers are written almost universally in the Hindu-Arabic numeral system, in which the most significant digits are written first in languages written left-to-right, and last in languages written right-to-left.[10] Optimization The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses (even when alignment restrictions are imposed). For example, a 32-bit memory location with content 4A 00 00 00 can be read at the same address as either 8-bit (value = 4A), 16-bit (004A), 24-bit (00004A), or 32-bit (0000004A), all of which retain the same numeric value. Although this little-endian property is rarely used directly by high-level programmers, it is often employed by code optimizers as well as by assembly language programmers. On the other hand, in some situations it may be useful to obtain an approximation of a multi-byte or multi-word value by reading only its mostsignificant portion instead of the complete representation; a big-endian processor may read such an approximation using the same base-address that would be used for the full value.

Calculation order Little-endian representation simplifies hardware in processors that add multi-byte integral values a byte at a time, such as small-scale byteaddressable processors and microcontrollers. As carry propagation must start at the least significant bit (and thus byte), multi-byte addition can then be carried out with a monotonic incrementing address sequence, a simple operation already present in hardware. On a big-endian processor, its addressing unit has to be told how big the addition is going to be so that it can hop forward to the least significant byte, then count back down towards the most significant. However, high performance processors usually perform these operations as a single operation, fetching multi-byte operands from memory in a single operation, so that the complexity of the hardware is not affected by the byte ordering. Diagram for mapping registers to memory locations

Mapping registers to memory locations Using this chart, one can map an access (or, for a concrete example: "write 32 bit value to address 0") from register to memory or from memory to register. To help in understanding that access, little and big endianness can be seen in the diagram as differing in their coordinate system's orientation. Big endianness's atomic units (in this example the atomic unit is the byte) and memory coordinate system increases in the diagram from left to right, while little endianness's units increase from right to left. A simple reminder is "In Little Endian, The Least significant byte goes into the Lowest value slot". So in the above example, D, the least significant byte, goes into slot 0. If you are writing in a western language the hex value 0x0a0b0c0d you are writing the bytes from left to right, you are implicitly writing BigEndian style. 0x0a at 0, 0x0b at 1, 0x0c at 2, 0x0d at 3. On the other hand the output of memory is normally also printed out bytewise from left to right, first memory address 0, then memory address 1, then memory address 2, then memory address 3. So on a Big-Endian system when you write a 32-bit value (from a register) to an address in memory and after that output the memory, you "see what you have written" (because you are using the left to right coordinate system for the output of values in registers as well as the output of memory). However on a Little-Endian system the logical 0 address of a value in a register (for 8-bit, 16-bit and 32-bit) is the least significant byte, the one to the right. 0x0d at 0, 0x0c at 1, 0x0b at 2, 0x0a at 3. If you write a 32 bit register value to a memory location on a Little-Endian system and after that output the memory location (with growing addresses from left to right), then the output of the memory will appear reversed (byte-swapped). You have 2 choices now to synchronize the output of what you are seeing as values in registers and what you are seeing as memory: You can swap the output of the register values (0x0a0b0c0d => 0x0d0c0b0a) or you can swap the output of the memory (print from right to left). Because the values of registers are interpreted as numbers, which are, in western languages, written from left to right, it is natural to use the second approach, to display the memory from right to left. The above diagram does exactly that, when visualizing memory (when "thinking memory") on a Little-Endian system the memory should be seen growing to the left. Examples of storing the value 0A0B0C0Dh in memory Note that hexadecimal notation is used. To illustrate the notions this section provides example layouts of the 32-bit number 0A0B0C0Dh in the most common variants of endianness. There exist several digital processors that use other formats, but these two are the most common in general processors. That is true for typical embedded systems as well as for general computer CPU(s). Most processors used in non CPU roles in typical computers (in storage units, peripherals etc.) also use one of these two basic formats, although not always 32-bit of course. All the examples refer to the storage in memory of the value. Big-endian

Atomic element size 8-bit, address increment 1-byte (octet) increasing addresses

... 0Ah 0Bh 0Ch 0Dh ... The most significant byte (MSB) value, which is 0Ah in our example, is stored at the memory location with the lowest address, the next byte value in significance, 0Bh, is stored at the following memory location and so on. This is akin to Left-to-Right reading in hexadecimal order. Atomic element size 16-bit increasing addresses ... 0A0Bh 0C0Dh ... The most significant atomic element stores now the value 0A0Bh, followed by 0C0Dh. Little-endian

Atomic element size 8-bit, address increment 1-byte (octet) increasing addresses ... 0Dh 0Ch 0Bh 0Ah ... The least significant byte (LSB) value, 0Dh, is at the lowest address. The other bytes follow in increasing order of significance. Atomic element size 16-bit increasing addresses ... 0C0Dh 0A0Bh ... The least significant 16-bit unit stores the value 0C0Dh, immediately followed by 0A0Bh. Note that 0C0Dh and 0A0Bh represent integers, not bit layouts (see bit numbering). Byte addresses increasing from right to left Visualising memory addresses from left to right makes little-endian values appear backwards. If the addresses are written increasing towards the left instead, each individual little-endian value will appear forwards. However strings of values or characters appear reversed instead. With 8-bit atomic elements: increasing addresses ... 0Ah 0Bh 0Ch 0Dh ... The least significant byte (LSB) value, 0Dh, is at the lowest address. The other bytes follow in increasing order of significance. With 16-bit atomic elements: increasing addresses ... 0A0Bh 0C0Dh ... The least significant 16-bit unit stores the value 0C0Dh, immediately followed by 0A0Bh. The display of text is reversed from the normal display of languages such as English that read from left to right. For example, the word "XRAY" displayed in this manner, with each character stored in an 8-bit atomic element: increasing addresses ... "Y" "A" "R" "X" ... If pairs of characters are stored in 16-bit atomic elements (using 8 bits per character), it could look even stranger: increasing addresses ... "AY" "XR" ...

This conflict between the memory arrangements of binary data and text is intrinsic to the nature of the little-endian convention, but is a conflict only for languages written left-to-right, such as Indo-European languages including English. For right-to-left languages such as Arabic and Hebrew, there is no conflict of text with binary, and the preferred display in both cases would be with addresses increasing to the left. (On the other hand, right-to-left languages have a complementary intrinsic conflict in the big-endian system.) Middle-endian Numerous other orderings, generically called middle-endian or mixed-endian, are possible. On the PDP-11 (16-bit little-endian) for example, the compiler stored 32-bit values with the 16-bit halves swapped from the expected little-endian order. This ordering is known as PDP-endian. storage of a 32-bit word on a PDP-11 increasing addresses ... 0Bh 0Ah 0Dh 0Ch ... The ARM architecture can also produce this format when writing a 32-bit word to an address 2 bytes from a 32-bit word alignment Segment descriptors on Intel 80386 and compatible processors keep a base 32-bit address of the segment stored in little-endian order, but in four nonconsecutive bytes, at relative positions 2,3,4 and 7 of the descriptor start. Endianness in networking

Many IETF RFCs use the term network order; it simply describes the order of transmission for bits and bytes over the wire in network protocols. Among others the historic RFC 1700 (also known as Internet standard STD 2) explains this is big endian order.[11] The telephone network, historically and presently, sends the most significant part first, the area code; doing so allows routing while a telephone number is being composed. The Internet Protocol defines big-endian as the standard network byte order used for all numeric values in the packet headers and by many higher level protocols and file formats that are designed for use over IP. The Berkeley sockets API defines a set of functions to convert 16-bit and 32-bit integers to and from network byte order: the htonl (host-to-network-long) and htons (host-to-network-short) functions convert 32-bit and 16-bit values respectively from machine (host) to network order; the ntohl and ntohs functions convert from network to host order. These functions may be a no-op on a big-endian system. In CANopen multi-byte parameters are always sent least significant byte first (little endian). While the lowest network protocols may deal with sub-byte formatting, all the layers above them usually consider the byte (mostly meant as octet) as their atomic unit. Endianness in files and byte swap Endianness is a problem when a binary file created on a computer is read on another computer with different endianness. Some compilers have built-in facilities to deal with data written in other formats. For example, the Intel Fortran compiler supports the non-standard CONVERT specifier, so a file can be opened as OPEN(unit,CONVERT='BIG_ENDIAN',...) or OPEN(unit,CONVERT='LITTLE_ENDIAN',...) Some compilers have options to generate code that globally enables the conversion for all file IO operations. This allows one to reuse code on a system with the opposite endianness without having to modify the code itself. If the compiler does not support such conversion, the programmer needs to swap the bytes via ad hoc code. Fortran sequential unformatted files created with one endianness usually cannot be read on a system using the other endianness because Fortran usually implements a record (defined as the data written by a single Fortran statement) as data preceded and succeeded by count fields, which are integers equal to the number of bytes in the data. An attempt to read such file on a system of the other endianness then results in a run-time error, because the count fields are incorrect. This problem can be avoided by writing out sequential binary files as opposed to sequential unformatted. Unicode text can optionally start with a byte order mark (BOM) to signal the endianness of the file or stream. Its code point is U+FEFF. In UTF32 for example, a big-endian file should start with 00 00 FE FF. In a little-endian file these bytes are reversed. Application binary data formats, such as for example MATLAB .mat files, or the .BIL data format, used in topography, are usually endiannessindependent. This is achieved by storing the data always in one fixed endianness, or carrying with the data a switch to indicate which endianness the data was written with. When reading the file, the application converts the endianness, transparently to the user. This is the case of TIFF image files, which instructs in its header about endianness of their internal binary integers. If a file starts with the signature "MM" it means that integers are represented as big-endian, while "II" means little-endian. Those signatures need a single 16-bit word each, and they are palindromes (that is, they read the same forwards and backwards), so they are endianness independent. "I" stands for Intel and "M" stands for Motorola, the respective CPU providers of the IBM PC compatibles and Apple Macintosh platforms in the 1980s. Intel CPUs are little-endian, while Motorola 680x0 CPUs are big-endian. This explicit signature allows a TIFF reader program to swap bytes if necessary when a given file was generated by a TIFF writer program running on a computer with a different endianness. The LabVIEW programming environment, though most commonly installed on Windows machines, was first developed on a Macintosh, and uses Big Endian format for its binary numbers, while most Windows programs use Little Endian format. [12] Note that since the required byte swap depends on the length of the variables stored in the file (two 2-byte integers require a different swap than one 4-byte integer), a general utility to convert endianness in binary files cannot exist. [citation needed] "Bit endianness" Main article: bit numbering The terms bit endianness or bit-level endianness are seldom used when talking about the representation of a stored value, as they are only meaningful for the rare computer architectures where each individual bit has a unique address. They are used however to refer to the transmission order of bits over a serial medium. Most often that order is transparently managed by the hardware and is the bit-level analogue of little-endian (low-bit first), although protocols exist which require the opposite ordering (e.g. IC, and SONET and SDH[13]). In networking, the decision about the order of transmission of bits is made in the very bottom of the data link layer of the OSI model. As bit ordering is usually only relevant on a very low level, terming transmissions "LSB first" or "MSB first" is more descriptive than assigning an endianness to bit ordering. Other meanings Some authors extend the usage of the word "endianness", and of related terms, to entities such as street addresses, date formats and others. Such usagesbasically reducing endianness to a mere synonym of ordering of the partsare non-standard usage[citation needed] (e.g., ISO 8601:2004 talks about "descending order year-month-day", not about "big-endian format"), do not have widespread usage, and are generally (other than for date formats) employed in a metaphorical sense. "Endianness" is sometimes used to describe the order of the components of a domain name, e.g. 'en.wikipedia.org' (the usual modern 'littleendian' form) versus the reverse-DNS 'org.wikipedia.en' ('big-endian', used for naming components, packages, or types in computer systems, for example Java packages, Macintosh ".plist" files, etc.). URLs can be considered 'big-endian', even though the host part could be a 'little-endian' DNS name. Endianness Endianess refers to the ordering of bytes in a multi-byte number. Big endian refers to the architecture where the most significant byte has the lowest address, while the opposite Little endian, the most significant byte has the highest address. Assuming that your machine is big or little-endian, you can find the endianess of your architecture using the following programming snippets: int x = 1; if(*(char *)&x == 1) printf("little-endian\n"); else printf("big-endian\n");

#define LITTLE_ENDIAN 0 #define BIG_ENDIAN 1

int machineEndianness() { short s = 0x0102; char *p = (char *) &s; if (p[0] == 0x02) // Lowest address contains the least significant byte return LITTLE_ENDIAN; else return BIG_ENDIAN; } Here is some code to determine what is the type of your machine int num = 1; if(*(char *)&num == 1) { printf("\nLittle-Endian\n"); } else { printf("Big-Endian\n"); } 5 down vote I surprised no-one has mentioned the macros which the pre-processor defines by default. While these will vary depending on your platform; they are much cleaner than having to write your own endian-check. For example; if we look at the built-in macros which GCC defines (on an X86-64 machine): :| gcc -dM -E -x c - |grep -i endian #define __LITTLE_ENDIAN__ 1 On a PPC machine I get: :| gcc -dM -E -x c - |grep -i endian #define __BIG_ENDIAN__ 1 #define _BIG_ENDIAN 1 (The :| gcc -dM -E -x c - magic prints out all built-in macros). For further details, you may want to check out this codeproject article Basic concepts on Endianness: How to dynamically test for the Endian type at run time? As explained in Computer Animation FAQ, you can use the following function to see if your code is running on a Little- or Big-Endian system: Collapse #define BIG_ENDIAN 0 #define LITTLE_ENDIAN 1 int TestByteOrder() { short int word = 0x0001; char *byte = (char *) &word; return(byte[0] ? LITTLE_ENDIAN : BIG_ENDIAN); } This code assigns the value 0001h to a 16-bit integer. A char pointer is then assigned to point at the first (least-significant) byte of the integer value. If the first byte of the integer is 0x01h, then the system is Little-Endian (the 0x01h is in the lowest, or least-significant, address). If it is 0x00h then the system is Big-Endian. I would do something like this: bool isBigEndian() { static unsigned long x(1); static bool result(reinterpret_cast<unsigned char*>(&x)[0] == 0); return result; } Along these lines, you would get a time efficient function that only does the calculation once. feedback int i=1; char *c=(char*)&i; bool littleendian=c; How about this? #include <cstdio> int main() { unsigned int n = 1; char *p = 0;

p = (char*)&n; if (*p == 1) std::printf("Little Endian\n"); else if (*(p + sizeof(int) - 1) == 1) std::printf("Big Endian\n"); else std::printf("What the crap?\n"); return 0; } See Endianness - C-Level Code illustration. // assuming target architecture is 32-bit = 4-Bytes enum ENDIANESS{ LITTLEENDIAN , BIGENDIAN , UNHANDLE };

ENDIANESS CheckArchEndianalityV1( void ) { int Endian = 0x00000001; // assuming target architecture is 32-bit // as Endian = 0x00000001 so MSB (Most Significant Byte) = 0x00 and LSB (Least // casting down to a single byte value LSB discarding higher bytes return (*(char *) &Endian == 0x01) ? LITTLEENDIAN : BIGENDIAN; } What operations are you planning on doing where endianness makes a difference? All standard arrithmetic, array manipulation etc. operations will be portable across the different architectures, as the differences will be taken care of for you by the different platform's compilers. Saving and loading files across differing-endian platforms comes to mind. fbrereto Feb 23 '10 at 1:01 feedback You can also do this via the preprocessor using something like boost header file which can be found boost endian Here's another C version. It defines a macro called wicked_cast() for inline type punning via C99 union literals and the non-standard __typeof__ operator. #include <limits.h> #if UCHAR_MAX == UINT_MAX #error endianness irrelevant as sizeof(int) == 1 #endif #define wicked_cast(TYPE, VALUE) \ (((union { __typeof__(VALUE) src; TYPE dest; }){ .src = VALUE }).dest) _Bool is_little_endian(void) { return wicked_cast(unsigned char, 1u); } If integers are single-byte values, endianness makes no sense and a compile-time error will be generated. Introduction A long time ago, in a very remote island known as Lilliput, society was split into two factions: Big-Endians who opened their soft-boiled eggs at the larger end ("the primitive way") and Little-Endians who broke their eggs at the smaller end. As the Emperor commanded all his subjects to break the smaller end, this resulted in a civil war with dramatic consequences: 11.000 people have, at several times, suffered death rather than submitting to breaking their eggs at the smaller end. Eventually, the 'Little-Endian' vs. 'Big-Endian' feud carried over into the world of computing as well, where it refers to the order in which bytes in multi-byte numbers should be stored, most-significant first (Big-Endian) or least-significant first (Little-Endian) to be more precise [1] Big-Endian means that the most significant byte of any multibyte data field is stored at the lowest memory address, which is also the address of the larger field. Little-Endian means that the least significant byte of any multibyte data field is stored at the lowest memory address, which is also the address of the larger field. For example, consider the 32-bit number, 0xDEADBEEF. Following the Big-Endian convention, a computer will store it as follows: Significant Byte) = 0x01

Figure 1. Big-Endian: The most significant byte is stored at the lowest byte address. Whereas architectures that follow the Little-Endian rules will store it as depicted in Figure 2:

Figure 2. Little-endian: Least significant byte is stored at the lowest byte address. The Intel x86 family and Digital Equipment Corporation architectures (PDP-11, VAX, Alpha) are representatives of Little-Endian, while the Sun SPARC, IBM 360/370, and Motorola 68000 and 88000 architectures are Big-Endians. Still, other architectures such as PowerPC, MIPS, and Intels 64 IA-64 are Bi-Endian, i.e. they are capable of operating in either Big-Endian or Little-Endian mode. [1]. Endianess is also referred to as the NUXI problem. Imagine the word UNIX stored in two 2-byte words. In a Big-Endian system, it would be stored as UNIX. In a little-endian system, it would be stored as NUXI. Which format is better? Like the egg debate described in the Gulliver's Travels, the Big- .vs. Little-Endian computer dispute has much more to do with political issues than with technological merits. In practice, both systems perform equally well in most applications. There is however a significant difference in performance when using Little-Endian processors instead of Big-Endian ones in network devices (more details below). How to switch from one format to the other? It is very easy to reverse a multi-byte number if you need the other format, it is simply a matter of swapping bytes and the conversion is the same in both directions. The following example shows how an Endian conversion function could be implemented using simple C unions: Collapse | Copy Code unsigned long ByteSwap1 (unsigned long nLongNumber) { union u {unsigned long vi; unsigned char c[sizeof(unsigned long)];}; union v {unsigned long ni; unsigned char d[sizeof(unsigned long)];}; union u un; union v vn; un.vi = nLongNumber; vn.d[0]=un.c[3]; vn.d[1]=un.c[2]; vn.d[2]=un.c[1]; vn.d[3]=un.c[0]; return (vn.ni); } Note that this function is intented to work with 32-bit integers. A more efficient function can be implemented using bitwise operations as shown below: Collapse | Copy Code unsigned long ByteSwap2 (unsigned long nLongNumber) { return (((nLongNumber&0x000000FF)<<24)+((nLongNumber&0x0000FF00)<<8)+ ((nLongNumber&0x00FF0000)>>8)+((nLongNumber&0xFF000000)>>24)); } And this is a version in assembly language: Collapse | Copy Code unsigned long ByteSwap3 (unsigned long nLongNumber) { unsigned long nRetNumber ; __asm { mov eax, nLongNumber xchg ah, al ror eax, 16 xchg ah, al mov nRetNumber, eax } return nRetNumber; } A 16-bit version of a byte swap function is really straightforward: Collapse | Copy Code

unsigned short ByteSwap4 (unsigned short nValue) { return (((nValue>> 8)) | (nValue << 8)); } Finally, we can write a more general function that can deal with any atomic data type (e.g. int, float, double, etc) with automatic size detection: Collapse | Copy Code #include <algorithm> //required for std::swap #define ByteSwap5(x) ByteSwap((unsigned char *) &x,sizeof(x)) void ByteSwap(unsigned char * b, int n) { register int i = 0; register int j = n-1; while (i<j) { std::swap(b[i], b[j]); i++, j--; } } For example, the next code snippet shows how to convert a data array of doubles from one format (e.g. Big-Endian) to the other (e.g. LittleEndian): Collapse | Copy Code double* dArray; //array in big-endian format int n; //Number of elements for (register int i = 0; i <n; i++) ByteSwap5(dArray[i]); Actually, in most cases, you won't need to implement any of the above functions since there are a set of socket functions (see Table I), declared in winsock2.h, which are defined for TCP/IP, so all machines that support TCP/IP networking have them available. They store the data in 'network byte order' which is standard and endianness independent. Function ntohs ntohl htons htonl Purpose Convert a 16-bit quantity from network byte order to host byte order (Big-Endian to Little-Endian). Convert a 32-bit quantity from network byte order to host byte order (Big-Endian to Little-Endian). Convert a 16-bit quantity from host byte order to network byte order (Little-Endian to Big-Endian). Convert a 32-bit quantity from host byte order to network byte order (Little-Endian to Big-Endian).

Table I: Windows Sockets Byte-Order Conversion Functions [2] The socket interface specifies a standard byte ordering called network-byte order, which happens to be Big-Endian. Consequently, all network communication should be Big-Endian, irrespective of the client or server architecture. Suppose your machine uses Little Endian order. To transmit the 32-bit value 0x0a0b0c0d over a TCP/IP connection, you have to call htonl() and transmit the result: Collapse | Copy Code TransmitNum(htonl(0x0a0b0c0d)); Likewise, to convert an incoming 32-bit value, use ntohl(): Collapse | Copy Code int n = ntohl(GetNumberFromNetwork()); If the processor on which the TCP/IP stack is to be run is itself also Big-Endian, each of the four macros (i.e. ntohs, ntohl, htons, htonl) will be defined to do nothing and there will be no run-time performance impact. If, however, the processor is Little-Endian, the macros will reorder the bytes appropriately. These macros are routinely called when building and parsing network packets and when socket connections are created. Serious run-time performance penalties occur when using TCP/IP on a Little-Endian processor. For that reason, it may be unwise to select a Little-Endian processor for use in a device, such as a router or gateway, with an abundance of network functionality. (Excerpt from reference [1]). One additional problem with the host-to-network APIs is that they are unable to manipulate 64-bit data elements. However, you can write your own ntohll() and htonll() corresponding functions: ntohll: converts a 64-bit integer to host byte order. ntonll: converts a 64-bit integer to network byte order. The implementation is simple enough: Collapse | Copy Code #define ntohll(x) (((_int64)(ntohl((int)((x << 32) >> 32))) << 32) | (unsigned int)ntohl(((int)(x >> 32)))) //By Runner #define htonll(x) ntohll(x) How to dynamically test for the Endian type at run time? As explained in Computer Animation FAQ, you can use the following function to see if your code is running on a Little- or Big-Endian system: Collapse | Copy Code #define BIG_ENDIAN 0 #define LITTLE_ENDIAN 1

1. 2. 3.

int TestByteOrder() { short int word = 0x0001; char *byte = (char *) &word; return(byte[0] ? LITTLE_ENDIAN : BIG_ENDIAN); } This code assigns the value 0001h to a 16-bit integer. A char pointer is then assigned to point at the first (least-significant) byte of the integer value. If the first byte of the integer is 0x01h, then the system is Little-Endian (the 0x01h is in the lowest, or least-significant, address). If it is 0x00h then the system is Big-Endian. Similarly, Collapse | Copy Code bool IsBigEndian() { short word = 0x4321; if((*(char *)& word) != 0x21 ) return true; else return false; } which is just the reverse of the same coin. You can also use the standard byte order APIs to determine the byte-order of a system at run-time. For example: Collapse | Copy Code bool IsBigEndian() { return( htonl(1)==1 ); } Auto detecting the correct Endian format of a data file Suppose you are developing a Windows application that imports Nuclear Magnetic Resonance (NMR) spectra. High resolution NMR files are generally recorded in Silicon or Sun Workstations but recently Windows or Linux based spectrometers are emerging as practical substitutes. It turns out that you will need to know in advance the Endian format of the file to parse correctly all the information. Here are some practical guidelines you can follow to decipher the correct Endianness of a data file: Typically, the binary file includes a header with the information about the Endian format. If the header is not present , you can guess the Endian format if you know the native format of the computer from which the file comes from. For instance, if the file was created in a Sun Workstation, the Endian format will most likely be Big-Endian. If none of the above points apply, the Endian format can be determined by trial and error. For example, if after reading the file assuming one format, the spectrum does not make sense, you will know that you have to use the other format. If the data points in the file are in floating point format (double), then the _isnan() function can be of some help to determine the Endian format. For example: Collapse | Copy Code double dValue; FILE* fp; (...) fread(&nValue, 1, sizeof(double), fp); bool bByteSwap = _isnan(dValue) ? true : false Note that this method does only guarantee that the byte swap operation is required if _isnan() returns a nonzero value (TRUE); if it returns 0 (FALSE), then it is not possible to know the correct Endian format without further information.

Compiler optimizations.... [modified]

Coriiander

15:24 3 May '11

As no one code was working as expected , I had to write my own, that I am sharing with all of you: #define ntohll(x) ( (ntohl((int)1) == (int)1) ? x : ( (_int64) (ntohl((int)(x & 0xffffffff))) << 32) | (_int64) ntohl((int)(x >> 32)) ) #define htonll(x) ntohll(x) Sign InView ThreadPermalink Question Prafulla Tekawade Assume that an integer pointer x is declared in a "C" program as int *x; Further assume that the location of the pointer is 1000 and it points to an address 2000 where value 500 is stored in 4 bytes. What is the output of printf("%d",*x); a) 500 b) 1000 c) undefined d) 2000

7:06 11 Jan '07

What will be the answer? I think it should be ( c) bcoz on m/c will little endian sys it will print 500 & those with big endian it will print 0 I am not pretty sure. Assume x retrives two bytes of memory Please help. Sign InView ThreadPermalink Detecting big/little endian at compile time? Patrick Hoffmann Does anybody know if there is a possibility to detetct big/little endian within the preprocessor?

11:58 8 Sep '06

10

(Best Regards,) Patrick Hoffmann

Sign InView ThreadPermalink errors in macro Anonymous #define ntohll(x) (((_int64)(ntohl((int)((x << 32) >> 32))) << 32) | (unsigned int)ntohl(((int)(x >> 32)))) //By Runner Don't forget the backslash to splice the preprocessor lines together: #define ntohll(x) (((_int64)(ntohl((int)((x << 32) >> 32))) << 32) | \ (unsigned int)ntohl(((int)(x >> 32)))) Whenever an argument is used in the body of a macro, write parentheses around it: #define ntohll(x) (((_int64)(ntohl((int)(((x) << 32) >> 32))) << 32) | \ (unsigned int)ntohl(((int)((x) >> 32))))

1.25/5 (4 votes) 12:10 3 Oct '05

In a sensible world, ntohll ought to deal with an yield unsigned integral types. Never programmed on Windows, so I don't know about whether it is sensible in this way: #define ntohll(x) (((_uint64)(ntohl((unsigned int)(((x) << 32) >> 32))) << 32) | \ (unsigned int)ntohl(((unsigned int)((x) >> 32)))) For portability, the right types to used are those provided by : #define ntohll(x) (((uint64_t)(ntohl((uint32_t)(((x) << 32) >> 32))) << 32) | \ (uint32_t)ntohl(((uint32_t)((x) >> 32)))) It is not really safe to trust the type of an argument. Cast it: #define ntohll(x) (((uint64_t)(ntohl((uint32_t)(((uint64_t)(x) << 32) >> 32))) << 32) | \ (uint32_t)ntohl(((uint32_t)((uint64_t)(x) >> 32)))) The first chunk shifts the low-order half up, then down, then up. Makes me dizzy: #define ntohll(x) (((uint64_t)(ntohl((uint32_t)(((uint64_t)(x))))) << 32) | \ (uint32_t)ntohl(((uint32_t)((uint64_t)(x) >> 32)))) I'm confused by the parens: 123 33 45 5567 77 76543 2 #define ntohll(x) (((uint64_t)(ntohl((uint32_t)(((uint64_t)(x))))) << 32) | \ (uint32_t)ntohl(((uint32_t)((uint64_t)(x) >> 32)))) 2 2 234 445 55 5 4321 #define ntohll(x) ((uint64_t)ntohl((uint32_t)(uint64_t)(x)) << 32 | \ (uint32_t)ntohl((uint32_t)((uint64_t)(x) >> 32))) A couple of casts are pointless: #define ntohll(x) ((uint64_t)ntohl((uint32_t)(x)) << 32 | \ ntohl((uint32_t)((uint64_t)(x) >> 32))) There might be more problems, even in my transformations. But it looks better to me. Sign InView ThreadPermalink What about left-to-right or right-to-left.? Hello, Great article.. From my experience, some people confuse two terms: - Endianes vs. Memory view. Memory can be described as "left-to-right" or "right-to-left". If we have the value of 0x01020304 stored as a 32bit value, according to Endianes and memory view we can get: Big-Endian 01 02 03 04 ... left-to-right A A+1 A+2 A+3 A+4 Little-Endian 01 02 03 04 ... right-to-left A+3 A+2 A+1 A A+4 Little-Endian 04 03 02 01 ... left-to-right A A+1 A+2 A+3 A+4 'A' represents a byte-address on the memory. 5.00/5 (1 vote) rbid 21:16 8 Jan '05

11

In another words, when you describe a 32bit register, some people put the MSB on the right and the LSB on the left or vice-versa, for describing a number with the same byte ordering (Endianes)

o o o o

o o

Understanding Big and Little Endian Byte Order Problems with byte order are frustrating, and I want to spare you the grief I experienced. Here's the key: Problem: Computers speak different languages, like people. Some write data "left-to-right" and others "right-to-left". A machine can read its own data just fine - problems happen when one computer stores data and a different type tries to read it. Solutions Agree to a common format (i.e., all network traffic follows a single format), or Always include a header that describes the format of the data. If the header appears backwards, it means data was stored in the other format and needs to be converted. Numbers vs. Data The most important concept is to recognize the difference between a number and the data that represents it. A number is an abstract concept, such as a count of something. You have ten fingers. The idea of "ten" doesn't change, no matter what representation you use: ten, 10, diez (Spanish), ju (Japanese), 1010 (binary), X (Roman numeral)... these representations all point to the same concept of "ten". Contrast this with data. Data is a physical concept, a raw sequence of bits and bytes stored on a computer. Data has no inherent meaning and must be interpreted by whoever is reading it. Data is like human writing, which is simply marks on paper. There is no inherent meaning in these marks. If we see a line and a circle (like this: |O) we may interpret it to mean "ten". But we assumed the marks referred to a number. They could have been the letters "IO", a moon of Jupiter. Or perhaps the Greek goddess. Or maybe an abbreviation for Input/Output. Or someone's initials. Or 10 in base 2, aka 2 in binary. The list of possibilities goes on. The point is that a single piece of data (|O) can be interpreted in many ways, and the meaning is unclear until someone clarifies the intent of the author. Computers face the same problem. They store data, not abstract concepts, and do so using a sequence of 1's and 0's. Later, they read back the 1's and 0's and try to recreate the abstract concept from the raw data. Depending on the assumptions made, the 1's and 0's can mean very different things. Why does this problem happen? Well, there's no rule that all computers must use the same language, just like there's no rule all humans need to. Each type of computer is internally consistent (it can read back its own data), but there are no guarantees about how another type of computer will interpret the data it created. Basic concepts Data (bits and bytes, or marks on paper) is meaningless; it must be interpreted to create an abstract concept, like a number. Like humans, computers have different ways to store the same abstract concept. (i.e., we have many ways to say "ten": ten, 10, diez, etc.) Storing Numbers as Data Thankfully, most computers agree on a few basic data formats (this was not always the case). This gives us a common starting point which makes our lives a bit easier: A bit has two values (on or off, 1 or 0) A byte is a sequence of 8 bits The "leftmost" bit in a byte is the biggest. So, the binary sequence 00001001 is the decimal number 9. 00001001 = (23 + 20 = 8 + 1 = 9). Bits are numbered from right-to-left. Bit 0 is the rightmost and the smallest; bit 7 is leftmost and largest. We can use these basic agreements as a building block to exchange data. If we store and read data one byte at a time, it will work on any computer. The concept of a byte is the same on all machines, and the idea of "Byte 0" is the same on all machines. Computers also agree on the order you sent them bytes -- they agree on which byte was sent first, second, third, etc. so "Byte 35" is the same on all machines. So what's the problem -- computers agree on single bytes, right? Well, this is fine for single-byte data, like ASCII text. However, a lot of data needs to be stored using multiple bytes, like integers or floatingpoint numbers. And there is no agreement on how these sequences should be stored. Byte Example Consider a sequence of 4 bytes, named W X Y and Z - I avoided naming them A B C D because they are hex digits, which would be confusing. So, each byte has a value and is made up of 8 bits. Byte Name: W X Y Z Location: 0 1 2 3 Value (hex): 0x12 0x34 0x56 0x78 For example, W is an entire byte, 0x12 in hex or 00010010 in binary. If W were to be interpreted as a number, it would be "18" in decimal (by the way, there's nothing saying we have to interpret it as a number - it could be an ASCII character or something else entirely ). With me so far? We have 4 bytes, W X Y and Z, each with a different value. Understanding Pointers Pointers are a key part of programming, especially the C programming language. A pointer is a number that references a memory location. It is up to us (the programmer) to interpret the data at that location. In C, when you cast (convert) a pointer to certain type (such as a char * or int *), it tells the computer how to interpret the data at that location. For example, let's declare void *p = 0; // p is a pointer to an unknown data type // p is a NULL pointer -- do not dereference char *c; // c is a pointer to a single byte

12

Note that we can't get the data from p because we don't know its type. p could be pointing at a single number, a letter, the start of a string, your horoscope, an image -- we just don't know how many bytes to read, or how to interpret what's there. Now, suppose we write c = (char *)p; Ah -- now this statement tells the computer to point to the same place as p, and interpret the data as a single character (1 byte). In this case, c would point to memory location 0, or byte W. If we printed c, we'd get the value in W, which is hex 0x12 (remember that W is a whole byte). This example does not depend on the type of computer we have -- again, all computers agree on what a single byte is (in the past this was not the case). The example is helpful, even though it is the same on all computers -- if we have a pointer to a single byte (char *, a single byte), we can walk through memory, reading off a byte at a time. We can examine any memory location and the endian-ness of a computer won't matter -- every computer will give back the same information. So, what's the problem? Problems happen when computers try to read multiple bytes. Some data types contain multiple bytes, like long integers or floating-point numbers. A single byte has only 256 values, so can store 0 - 255. Now problems start - when you read multi-byte data, where does the biggest byte appear? Big endian machine: Stores data big-end first. When looking at multiple bytes, the first byte (lowest address) is the biggest. Little endian machine: Stores data little-end first. When looking at multiple bytes, the first byte is smallest. The naming makes sense, eh? Big-endian thinks the big-end is first. (By the way, the big-endian / little-endian naming comes from Gulliver's Travels, where the Lilliputans argue over whether to break eggs on the little-end or big-end. Sometimes computer debates are just as meaningful ) Again, endian-ness does not matter if you have a single byte. If you have one byte, it's the only data you read so there's only one way to interpret it (again, because computers agree on what a byte is). Now suppose we have our 4 bytes (W X Y Z) stored the same way on a big-and little-endian machine. That is, memory location 0 is W on both machines, memory location 1 is X, etc. We can create this arrangement by remembering that bytes are machine-independent. We can walk memory, one byte at a time, and set the values we need. This will work on any machine: c = 0; // point to location 0 (won't work on a real machine!) *c = 0x12; // Set W's value c = 1; // point to location 1 *c = 0x34; // Set X's value ... // repeat for Y and Z; details left to reader This code will work on any machine, and we have both set up with bytes W, X, Y and Z in locations 0, 1, 2 and 3. Interpreting Data Now let's do an example with multi-byte data (finally!). Quick review: a "short int" is a 2-byte (16-bit) number, which can range from 0 - 65535 (if unsigned). Let's use it in an example: short *s; // pointer to a short int (2 bytes) s = 0; // point to location 0; *s is the value So, s is a pointer to a short, and is now looking at byte location 0 (which has W). What happens when we read the value at s? Big endian machine: I think a short is two bytes, so I'll read them off: location s is address 0 (W, or 0x12) and location s + 1 is address 1 (X, or 0x34). Since the first byte is biggest (I'm big-endian!), the number must be 256 * byte 0 + byte 1, or 256*W + X, or 0x1234. I multiplied the first byte by 256 (2^8) because I needed to shift it over 8 bits. Little endian machine: I don't know what Mr. Big Endian is smoking. Yeah, I agree a short is 2 bytes, and I'll read them off just like him: location s is 0x12, and location s + 1 is 0x34. But in my world, the first byte is the littlest! The value of the short is byte 0 + 256 * byte 1, or 256*X + W, or 0x3412. Keep in mind that both machines start from location s and read memory going upwards. There is no confusion about what location 0 and location 1 mean. There is no confusion that a short is 2 bytes. But do you see the problem? The big-endian machine thinks s = 0x1234 and the little-endian machine thinks s = 0x3412. The same exact data gives two different numbers. Probably not a good thing. Yet another example Let's do another example with 4-byte integer for "fun": int *i; // pointer to an int (4 bytes on 32-bit machine) i = 0; // points to location zero, so *i is the value there Again we ask: what is the value at i? Big endian machine: An int is 4 bytes, and the first is the largest. I read 4 bytes (W X Y Z) and W is the largest. The number is 0x12345678. Little endian machine: Sure, an int is 4 bytes, but the first is smallest. I also read W X Y Z, but W belongs way in the back -- it's the littlest. The number is 0x78563412. Same data, different results - not a good thing. Here's an interactive example using the numbers above, feel free to plug in your own: The NUXI problem Issues with byte order are sometimes called the NUXI problem: UNIX stored on a big-endian machine can show up as NUXI on a little-endian one. Suppose we want to store 4 bytes (U, N, I and X) as two shorts: UN and IX. Each letter is a entire byte, like our WXYZ example above. To store the two shorts we would write: short *s; // pointer to set shorts s = 0; // point to location 0 *s = UN; // store first short: U * 256 + N (fictional code) s = 2; // point to next location *s = IX; // store second short: I * 256 + X This code is not specific to a machine. If we store "UN" on a machine and ask to read it back, it had better be "UN"! I don't care about endian issues, if we store a value on one machine, we need to get the same value back. However, if we look at memory one byte at a time (using our char * trick), the order could vary. On a big endian machine we see

13

Byte: U N I X Location: 0 1 2 3 Which make sense. U is the biggest byte in "UN" and is stored first. The same goes for IX: I is the biggest, and stored first. On a little-endian machine we would see: Byte: N U X I Location: 0 1 2 3 And this makes sense also. "N" is the littlest byte in "UN" and is stored first. Again, even though the bytes are stored "backwards" in memory, the little-endian machine knows it is little endian, and interprets them correctly when reading the values back. Also, note that we can specify hex numbers such as x = 0x1234 on any machine. Even a little-endian machine knows what you mean when you write 0x1234, and won't force you to swap the values yourself (you specify the hex number to write, and it figures out the details and swaps the bytes in memory, under the covers. Tricky.). This scenario is called the "NUXI" problem because byte sequence UNIX is interpreted as NUXI on the other type of machine. Again, this is only a problem if you exchange data -- each machine is internally consistent. Exchanging Data Between Endian Machines Computers are connected - gone are the days when a machine only had to worry about reading its own data. Big and little-endian machines need to talk and get along. How do they do this? Solution 1: Use a common format The easiest approach is to agree to a common format for sending data over the network. The standard network order is actually big-endian, but some people get uppity that little-endian didn't win... we'll just call it "network order". To convert data to network order, machines call a function hton (host-to-network). On a big-endian machine this won't actually do anything, but we won't talk about that here (the little-endians might get mad). But it is important to use hton before sending data, even if you are big-endian. Your program may be so popular it is compiled on different machines, and you want your code to be portable (don't you?). Similarly, there is a function ntoh (network to host) used to read data off the network. You need this to make sure you are correctly interpreting the network data into the host's format. You need to know the type of data you are receiving to decode it properly, and the conversion functions are: htons()--"Host to Network Short" htonl()--"Host to Network Long" ntohs()--"Network to Host Short" ntohl()--"Network to Host Long" Remember that a single byte is a single byte, and order does not matter. These functions are critical when doing low-level networking, such as verifying the checksums in IP packets. If you don't understand endian issues correctly your life will be painful - take my word on this one. Use the translation functions, and know why they are needed. Solution 2: Use a Byte Order Mark (BOM) The other approach is to include a magic number, such as 0xFEFF, before every piece of data. If you read the magic number and it is 0xFEFF, it means the data is in the same format as your machine, and all is well. If you read the magic number and it is 0xFFFE (it is backwards), it means the data was written in a format different from your own. You'll have to translate it. A few points to note. First, the number isn't really magic, but programmers often use the term to describe the choice of an arbitrary number (the BOM could have been any sequence of different bytes). It's called a byte-order mark because it indicates the byte order the data was stored in. Second, the BOM adds overhead to all data that is transmitted. Even if you are only sending 2 bytes of data, you need to include a 2-byte BOM. Ouch! Unicode uses a BOM when storing multi-byte data (some Unicode character encodings can have 2, 3 or even 4-bytes per character). XML avoids this mess by storing data in UTF-8 by default, which stores Unicode information one byte at a time. And why is this cool? (Repeated for the 56th time) "Because endian issues don't matter for single bytes". Right you are. Again, other problems can arise with BOM. What if you forget to include the BOM? Do you assume the data was sent in the same format as your own? Do you read the data and see if it looks "backwards" (whatever that means) and try to translate it? What if regular data includes the BOM by coincidence? These situations are not fun. Why are there endian issues at all? Can't we just get along? Ah, what a philosophical question. Each byte-order system has its advantages. Little-endian machines let you read the lowest-byte first, without reading the others. You can check whether a number is odd or even (last bit is 0) very easily, which is cool if you're into that kind of thing. Big-endian systems store data in memory the same way we humans think about data (left-to-right), which makes low-level debugging easier. But why didn't everyone just agree to one system? Why do certain computers have to try and be different? Let me answer a question with a question: Why doesn't everyone speak the same language? Why are some languages written left-to-right, and others right-to-left? Sometimes communication systems develop independently, and later need to interact. Epilogue: Parting Thoughts Endian issues are an example of the general encoding problem - data needs to represent an abstract concept, and later the concept needs to be created from the data. This topic deserves its own article (or series), but you should have a better understanding of endian issues. More information:

14

Vous aimerez peut-être aussi