ARM Goes 64-bit

来源：百度文库编辑：超级军网时间：2024/04/19 13:16:04

In the early 1980′s, Reduced Instruction Set Computing (RISC) promised huge efficiencies to new instruction set architectures. In particular, the simpler ISAs enabled small teams to design very high performance processors, compared with existing complex instruction sets (collectively termed CISC). Within a short span of time, nearly a dozen RISC architectures were born, targeting similar markets. MIPS, SPARC, PA-RISC, POWER, and eventually Alpha all pursued high performance microprocessors for workstations and servers. But other RISC families focused on personal computers or embedded applications, such as AMD’s 29k and most famously, ARM.
As an earlier article on ARM’s embedded success discussed, ARM was conceived in 1983 and originally intended for Acorn’s personal computers. The emphasis of the architecture was on fast handling of interrupts and I/O, which were judged to be essential for interactive use. While the ARM project was successful and the first chips were manufactured in 1985, Acorn Computers ultimately succumbed to the success of IBM-compatible PCs based on Microsoft Windows and Intel’s x86.
Fortunately, Apple and others had shown some interest and ARM was spun out from Acorn into a separate company in 1990. The result of the joint work from Apple and ARM was the ARM6, which was used in the Newton PDA, an early precursor to the iPad and iPhone. Later the ISA was licensed to Digital Equipment, which produced a high frequency design called StrongARM.
The most unique aspect of ARM is that the company has never sold a microprocessor. ARM only produces IP, which is ultimately embedded in the products of another company – first Acorn and the RiscPC, now the iPhone and the majority of smartphones. This was a fundamental departure from other early RISC families, where the target application was proprietary UNIX-based micro-computers and servers. SPARC, PA-RISC, MIPS, Alpha and POWER were all intended to yield microprocessors for internal uses. In contrast, ARM flourished as an embedded processor for low-end applications such as hard disk controllers. When Intel entered the server market with the Pentium Pro, it spelled a long slow death for many of the high-end RISC families, but was a non-event for ARM.
The current version of the ARM architecture is ARMv7, which encompasses three similar but not fully compatible profiles. The A-profile Application processors, including the A5, A7, A8, and A9, are all general purpose designs for low-power applications, such as smartphones. While floating point was initially optional, it has become a de facto requirement due to software compatibility issues. The R-profile targets Real-time uses with moderate performance requirements (e.g. disk controllers and baseband processors), but does not support virtual memory and floating point is still optional. The Microcontroller oriented M-profile is similar to the R-series, but with lower performance and cost.
Commonly, these ARM designs are purchased as IP and then incorporated into a System-on-Chip (SoC) by companies like Apple, Nvidia, and even Intel. However, in addition to ARM’s designs, there are several companies with architecture licenses. Instead of receiving an IP block, these licensees are free to create custom ARM-compatible designs to meet different requirements. While the identity of these architecture licensees is not necessarily disclosed, current licensees include Qualcomm (Snapdragon), Microsoft, Marvell, and Samsung.
Last year, ARM announced the 64-bit ARMv8 for Application processors. The new architecture is elegant, backwards compatible, and removes several crufty features from the existing ARMv7. Applied Micro simultaneously demonstrated an FPGA implementation of X-Gene, a custom 64-bit server processor based on ARMv8 that is expected in 2013. Applied Micro and others such as Microsoft helped ARM to define and shape the new architecture. More recently, Samsung and Cavium Networks have both taken architectural licenses and along with start up Calxeda, are expected to produce ARM-based server processors. As an aside, it is quite possible that ARM has developed an additional server profile, or these companies could be using the A-profile.
One of the main motivations for ARMv8 was memory addressing. The existing architecture was limited to a 4GB virtual address space, which is an uncomfortable constraint for systems with 2GB or more physical memory. The first round of 64-bit extensions were developed in the 1990′s for server-oriented RISC families. In the early 2000′s, the client-centric x86 bumped into the same virtual memory limitations and was extended to x86-64. Now a decade after x86, ARM-based tablets routinely ship with 1-2GB of memory, approaching the practical limit. With 20 years of history, ARM had a tremendous opportunity to study different approaches and learn from the mistakes and successes of others.

Overview
The ARMv8 architecture is a relatively elegant and compatible with ARMv7 for Application processors (i.e. A-profile) targeted at general purpose workloads. The most significant aspect of ARMv8 is the addition of a new 64-bit instruction set to complement the existing 32-bit ISA. The new instructions are known as A64 and operate on the AArch64 architectural state. ARMv8 also comprises A32 and T32 (for Thumb) and AArch32, which are backward compatible with ARMv7. However, the A64 instruction set is entirely separate and actually uses a slightly different format and new decoding tables.
ARMv7 is undeniably a RISC instruction set, but it has quite a few rather unattractive features that both clash with the ideals behind RISC and complicate real implementations. This is hardly surprising, as most RISC architectures had particular quirks that stemmed from history and ARM is no different. Many of the oddities in ARM are a result of the focus on embedded computing, e.g. banked registers for fast interrupt handling.
While ARMv8 is intended to be backwards compatible, A64 is moderately different from the existing 32-bit ARM architecture. A64 both adds capabilities and also eliminates some of the more obnoxious aspects of the architecture. Some of the new enhancements in A64 are applicable to A32, so software written for AArch32 is not compatible with ARMv7 implementations. Likewise, some instructions have been removed in AArch64 that may impact ARMv7 code – although these changes are unlikely to impact most software.
ARM targeted two data models for the 64-bit mode, to address the key OS partners. The first is LP64, where integers are 32-bit, and long integers are 64-bit, which is used by Linux, most UNIXes and OS X. The other is LLP64, where integers and long integers are 32-bit, while long long integers are 64-bit, and favored by Microsoft Windows.
One of the more substantial changes in AArch64 is the new exception and privilege model. AArch64 includes 4 Exception Levels (0-3), which replaces the 8 different processor modes found in ARMv7. EL0 loosely corresponds to user-mode, EL1 to kernel-mode with EL2 for hypervisors and EL3 for ARM’s TrustZone security monitor. EL3 is the most privileged, with EL0 as the least. The new privilege model is much simpler and relatively similar to existing approaches, such as x86 ring.
As with other 64-bit ISAs, there is a high degree of inter-operability, as shown by Figure 1. AArch64 hypervisors and/or OSes can support AArch32 at lower privilege levels (e.g. AArch32 guest OS on a AArch64 hypervisor or AArch32 apps on an AArch64 OS). However, higher privilege levels cannot be in AArch32 if lower levels are AArch64 (e.g. AArch64 OS on AArch32 hypervisor is not valid).
Transitions between AArch32 and AArch64 can only occur at exceptions and exception returns. For backwards compatibility, AArch32 still retains the rather complicated ARMv7 exception model which features 8 different privilege modes.
ARM’s approach of separate ISAs for compatibility is a contrast to Intel and AMD’s solution with x86, where the newer modes are truly an extension of the existing architecture. In x86, older 16-bit and 32-bit instructions are essentially a subset of the newer 32-bit and 64-bit operating modes. An extended ISA allows mixing new and old code more seamlessly. However, separate ISAs are necessary to eliminate the older exception model in ARMv7 and move to the newer and more elegant system in AArch64.
The one surprise in ARMv8, is the omission of any explicit support for multi-threading. Nearly every other major architecture, x86, MIPS, SPARC, and Power has support for multi-threading and at least one or two multi-threaded implementations. While initially billed as a technique for servers, it is helpful for a variety of software. However, multi-threading is very difficult to validate, and the engineers at ARM may have simply felt that handling the transition to a 64-bit architecture was challenging enough. Certainly, should the lack of multi-threading become a significant competitive disadvantage, it can be added in the future.

General Register State

AArch32 was not particularly regular and one of the biggest complications was the relationship between the registers and exception modes. AArch32 includes 13 general registers (R0-12), the Program Counter (R15) and 2 banked registers that contain the Stack Pointer (R13) and Link Register (R14). The user and system modes share these 16 registers and a Program Status Register (PSR). The fast interrupt (FIQ) mode shares R0-7 and the PC, with its own private R8-14 and Saved PSR. All other exception modes have private banked registers and Saved PSRs. This complicated register banking was one of the techniques originally used to reduce the latency for exceptions, which made ARM particularly suitable for embedded controllers. However, this has the drawback of requiring >40 registers, of which less than half can be used simultaneously – a clear problem from the standpoint of power and area efficiency.
Like x86, ARM took the opportunity to extend, expand and simplify the architectural registers. Naturally, the new GPRs are all 64-bits wide to handle larger addresses. 32-bit accesses use the lower half of registers and either ignore or zero out the upper half. There are more GPRs, and the banking is reduced to 4 different levels. There are 30 GPRs (X0-29), a Procedure Link Register (X30), and X31 acts as a hardwired zero register. Unlike A32, the PC is a special named register that can only be used for explicit control flow instructions and certain addressing modes. Additionally, each of the 4 privilege levels has 3 private banked registers, the Exception Link Register, Stack Pointer and Saved PSR. The AArch32 registers map onto the lower half of the AArch64 registers, which enables running AArch32 on top of AArch64.
Vector Register State

As with most popular architectures, ARMv7 has scalar floating point (VFP) and vector extensions with integer and floating point data (NEON, also known as Advanced SIMD). In ARMv7, these two extensions share a single register file. Both VFP and SIMD are carried over to AArch64, along with the shared register file. However, the two extensions are a standard part of ARMv8, whereas they were optional in some ARMv7 implementations.
Previously, there were 32 vector registers, each 64-bits wide. Pairs of adjacent registers were aliased to provide 16 virtual 128-bit registers for the SIMD instructions. Leaving no stone unturned, ARM’s architects took the opportunity to tweak this arrangement. In ARMv8, all 32 vector registers (V0-V31) are extended to 128-bits, doubling the capacity. Instead of using pairs of smaller registers to form larger virtual registers, the lower half of these 128-bit registers alias to the existing 64-bit registers. As with the GPRs, partial accesses will either ignore or zero out the upper half of a vector register.
ARMv8 Instruction Set Changes

As to be expected, the most substantive changes in the A64 ISA are the memory model and related instructions. The A64 instruction set that operates on AArch64 is largely similar to the existing ISA, but without various idiosyncrasies that are problematic for modern microprocessors. As with A32, instructions are fixed length, requiring 32-bits to specify as many as 3 operands.
Unlike ARMv7 though, there is currently no support for a 64-bit version of Thumb to improve instruction density. One challenge to a potential T64 instruction set is that larger addresses and branch offsets will crowd the instruction format. A64 already decreased the offset range for conditional branches to +/-1MB (from +/-32MB in ARMv7), and further reductions would be rather painful.
All instructions in A32 were conditional, using predication. However, predication uses bits in the instruction encoding that are already at a premium due to doubling the number of registers in AArch64. Moreover, predication complicates out-of-order execution since it adds an extra input that must be renamed. In A64, the only conditional instructions are branch, comparison and select. While this will slightly increase code size, the simplification is worthwhile.
Similarly, A32 had an in-line shifter that could be used for ‘free’ with nearly every integer instruction. However, shifts are notorious difficult to implement at high frequency due to the complicated wiring. An implicit shift in every instruction will increase the length of pipeline stages and reduce the overall frequency. A64 instructions can apply a very limited shift to the destination register, and there are new instructions to handle more complicated cases such as variable shifts.
Most of the changes to VFP in A64 are relatively minor. VFP is intended for single precision and double precision scalar computation. There are new instructions to satisfy the IEEE754-2008 standard, particularly calculating the min and max of two numbers. Floating point comparisons now set the integer condition flags, rather than the flags in the FP Status Register. There are also new conversion instructions between various FP formats, and also the new 64-bit integers.
The advanced SIMD is the vector counterpart to VFP and has been more aggressively enhanced. In particular, the vector instructions in ARMv7 could operate on integer, single precision and rarely polynomial data. In A64, the vector elements also include double precision floating point and have full IEEE support with the required rounding modes, and handling of denormals and NaNs.
Similar to SSE or AVX, the advanced SIMD instructions are variable length vectors that depend on the size of the registers and data types. The underlying registers are either 64-bit or 128-bit and can pack from 1-16 elements. The integer data types are mostly unchanged, spanning from a single byte to 64-bits. Floating point data can be stored in half-precision, but operations are all single precision or double precision.
With the move to 128-bit registers, A64 includes new instructions for inserting and extracting vector elements. There are also three cross-lane instructions for vector reductions, specifically summing and taking the minimum or maximum value.
Several existing instructions including comparison, add, absolute value and negation have been extended to operate on 64-bit integer elements. There are also new instructions for data type conversion, floating point normalization and saturating integer arithmetic. Lastly, ARMv8 includes a variety of optional cryptographic instructions, that are intended to complement existing hardware accelerators. ARM opted to focus on AES encryption, the SHA1 and SHA256 hashing algorithms and Galois fields with 16 new instructions.Virtual Address Space

The most prominent and visible aspect of ARMv8 and A64 is extending the virtual addressing, but there are many other improvements to the memory architecture. Currently, AArch64 features two 48-bit virtual address spaces, one for the kernel and one for applications. Application addressing starts at 0 and grows upwards, while kernel space grows down from 264; any references to unmapped addresses in between will trigger a fault. Pointers are sign extended to 64-bits, and can optionally be configured to use the upper 8-bits for tagging pointers with additional information.
The translation tables for each virtual address space are mapped using either traditional 4KB pages or a new larger 64KB page. The minimum page size determines which page table format will be used. The 64KB pages can improve performance, at the cost of substantially increasing memory fragmentation and utilization.
For virtual address spaces with 4KB pages, a 4 level table is used with 9 bits translated per lookup. In this case, 64KB pages will help index more data, but not reduce the number of look ups. When using 64KB pages, each lookup provides 13 address bits and only 3 levels are necessary. In fact, if addresses smaller than 42-bits are used, the 64KB page tables only need two lookups for translation.
Addressing and Memory Instructions

Like all RISCs, ARM is a strict load/store instruction set that separates memory accesses from arithmetic. ARMv7 and A32 have a single relatively nice indexed addressing mode with optional pre- and post-incrementing. A base register is added to a scaled (i.e. shifted) offset (either another register or a immediate). Optionally, the offset pre- or post-updates the base register, which is useful for handling loops. Since the PC is an ordinary register in AArch32, the indexed addressing mode can be used for PC-relative addressing as well. Since ARMv6, unaligned memory accesses have been supported for single loads or stores.
A64 addressing is generally similar in terms of capabilities, but has been adapted to simplify address calculation. There are two separate addressing modes, the familiar indexed mode and a new PC-relative mode, since the PC cannot be accessed like a regular register. As with ARMv7, unaligned accesses are allowed, but have a performance penalty.
The indexed addressing is still robust, but the incrementing is limited to simplify the critical path in address generation. The 64-bit base register is added to a scaled offset. The offset can be an immediate, a 64-bit register or a sign-extended 32-bit register. However, pre-incrementing is only available with unscaled immediate offsets. Any load or store can post-increment with an unscaled immediate, but only SIMD loads and stores can use post-increment with a register offset. The immediates are generally limited to 9-bits signed. However, for base plus offset a scaled 12-bit unsigned immediate is available.
A new literal addressing mode is used to calculate PC-relative addresses when accessing at least 32-bits of data. Literal addressing replaces the base register with the PC and adds a 19-bit signed offset, giving a relatively limited range of +/-1MB. While this preserves some PC-relative capabilities, it is significantly less flexible than addressing in AArch32.
ARMv7 included several instructions that could access multiple memory locations. In particular, load multiple and pop can read all of the registers from memory, while store multiple and push can write all the registers to memory. These instructions must be micro-coded to handled mis-speculation, exceptions and interrupts and are a potential source of complexity. AArch64 eliminates all multiple memory access instructions to simplify the microarchitecture at the cost of instruction density. To accelerate multiple accesses, two new instructions, load pair and store pair have been added.
Load and store pair access a pair of independent registers from adjacent memory locations with unaligned support. However, the addressing modes are somewhat more limited than normal accesses. Specifically, the pair access instructions can only use the base register plus a scaled 7-bit signed immediate, with optional pre- and post-increment. The pair instructions are clever, since only a single address calculation is needed, saving a little power. The pair instructions are also available as non-temporal accesses, although with only base plus immediate addressing.
Memory Ordering Model

As part of defining ARMv8, the architects paid careful attention to defining a clean memory model. This is particularly crucial for an architecture which will have many different teams working on implementations, since memory ordering is responsible for the most complex and difficult bugs in both hardware and software.
ARMv8 has a Release Consistency memory model, which is relatively weak. It is very similar to the Itanium memory model, and aligns well with C++11. This choice was motivated primarily by power efficiency. Generally, weak ordering models are more difficult to program, because there are few guarantees. However, weak ordering can also reduce the buffering that is required for in-flight loads and stores in a multi-processor system and reduce power consumption.
In the ARMv8 memory model, an aligned memory access that targets a single GPR is guaranteed to be atomic. Load pair and store pair instructions are guaranteed to appear as two individual atomic accesses, if targeting GPRs and naturally aligned. Unaligned accesses are not atomic, and as a practical matter are likely to be split into at least two accesses and a shift. Moreover, vector memory accesses (whether SIMD or scalar FP) are not guaranteed to be atomic at all. To allow programmers to write concurrent software, a number of synchronization primitives are available.
ARMv7 and v8 features three different types of barriers: a Data Synchronization Barrier (DSB), Data Memory Barrier (DMB), and an Instruction Synchronization Barrier (ISB). A DSB stalls the processor until all pending loads and stores have completed. A DMB forces all earlier (in program order) memory accesses to become globally visible before any subsequent accesses. An ISB flushes the CPU pipeline and any prefetch buffers, forcing any subsequent instructions to be fetched from cache or memory. Since ARM does not have coherent instruction caching, this is necessary (but not sufficient) for modifying instructions in memory.
ARMv7 and v8 also incorporate exclusive (or atomic) memory accesses, which are sometimes described as a load-linked and store-conditional (LL/SC). The load-linked instruction will read a value from an address in memory, and then the store-conditional will write a new value to the same address in memory if no other writes to the address have occurred. LL/SC is quite useful for constructing other synchronization primitives such as spinlocks. The LL/SC can be combined with pair instructions to atomically update a location that spans two registers.
ARMv8 introduces the new and elegant one-sided fences associated with Release Consistency: load-acquire and store-release. Unlike the barriers in ARMv7, these fences are address-based synchronization primitives. A load-acquire guarantees that any later (in program order) memory accesses will only be visible after the load-acquire. A store-release guarantees that all earlier memory accesses will be visible before the store-release becomes visible. Moreover, the store-release becomes visible to all caching agents in the system simultaneously. The two can be combined to form a full fence as well, a store-release and a load-acquire will be globally visible in program order.
The address-based synchronization primitives, load-acquire/store-release and LL/SC are all limited to only use base register addressing, with no offsets, indexing or increments, which simplifies the implementation.
Conclusion

The ARMv8 architecture is classically British; a clean and elegant 64-bit instruction set, with backwards compatibility for existing 32-bit software. The new AArch64 is certainly an improvement over ARMv7, with many improvements above and beyond simply extending the virtual address space to 48-bits.
The most notable additions in ARMv8 are the larger and highly regular integer register file, double precision vectors with IEEE support, and new synchronization primitives with a well-defined memory ordering model. In some respects though, the more significant changes came not from adding features, but removing them.
Like x86, ARMv7 had a fair bit of cruft, and the architects took care to remove many of the byzantine aspects of the instruction set that were difficult to implement. The peculiar interrupt modes and banked registers are mostly gone. Predication and implicit shift operations have been dramatically curtailed. The load/store multiple instructions have also been eliminated, replaced with load/store pair. Collectively, these changes make AArch64 potentially more efficient than ARMv7 and easier to implement in modern process technology.
There are no ARMv8 implementations available to judge the merits of the architecture in practice. But overall, ARMv8 is clearly a sound design that was well thought out and should enable reasonable implementations.
The vast majority of companies will wait for a licensable core design from ARM. However, those with the resources and expertise to design a CPU core will forge ahead and should have a time to market advantage and a potential differentiating factor. Applied Micro should be first to market, but others will swiftly follow, including Cavium Networks, Qualcomm, Samsung, and Nvidia.
Certainly, the next few years should prove very interesting. The number of ARMv8 architecture licensees looks set to grow, which should inject some additional diversity into the industry. However, it is unclear whether the market is large enough to support so many companies in the long term. Future ARMv8 cores will undoubtedly be found in Apple’s iPhone and iPad, along with Android devices from TI, Samsung, and others. The real question is whether ARMv8 will enable ARM’s partners to move up the value chain to servers and notebooks. However, that requires competing with Intel, which has a massive advantage in process technology over the rest of the industry.In the early 1980′s, Reduced Instruction Set Computing (RISC) promised huge efficiencies to new instruction set architectures. In particular, the simpler ISAs enabled small teams to design very high performance processors, compared with existing complex instruction sets (collectively termed CISC). Within a short span of time, nearly a dozen RISC architectures were born, targeting similar markets. MIPS, SPARC, PA-RISC, POWER, and eventually Alpha all pursued high performance microprocessors for workstations and servers. But other RISC families focused on personal computers or embedded applications, such as AMD’s 29k and most famously, ARM.
As an earlier article on ARM’s embedded success discussed, ARM was conceived in 1983 and originally intended for Acorn’s personal computers. The emphasis of the architecture was on fast handling of interrupts and I/O, which were judged to be essential for interactive use. While the ARM project was successful and the first chips were manufactured in 1985, Acorn Computers ultimately succumbed to the success of IBM-compatible PCs based on Microsoft Windows and Intel’s x86.
Fortunately, Apple and others had shown some interest and ARM was spun out from Acorn into a separate company in 1990. The result of the joint work from Apple and ARM was the ARM6, which was used in the Newton PDA, an early precursor to the iPad and iPhone. Later the ISA was licensed to Digital Equipment, which produced a high frequency design called StrongARM.
The most unique aspect of ARM is that the company has never sold a microprocessor. ARM only produces IP, which is ultimately embedded in the products of another company – first Acorn and the RiscPC, now the iPhone and the majority of smartphones. This was a fundamental departure from other early RISC families, where the target application was proprietary UNIX-based micro-computers and servers. SPARC, PA-RISC, MIPS, Alpha and POWER were all intended to yield microprocessors for internal uses. In contrast, ARM flourished as an embedded processor for low-end applications such as hard disk controllers. When Intel entered the server market with the Pentium Pro, it spelled a long slow death for many of the high-end RISC families, but was a non-event for ARM.
The current version of the ARM architecture is ARMv7, which encompasses three similar but not fully compatible profiles. The A-profile Application processors, including the A5, A7, A8, and A9, are all general purpose designs for low-power applications, such as smartphones. While floating point was initially optional, it has become a de facto requirement due to software compatibility issues. The R-profile targets Real-time uses with moderate performance requirements (e.g. disk controllers and baseband processors), but does not support virtual memory and floating point is still optional. The Microcontroller oriented M-profile is similar to the R-series, but with lower performance and cost.
Commonly, these ARM designs are purchased as IP and then incorporated into a System-on-Chip (SoC) by companies like Apple, Nvidia, and even Intel. However, in addition to ARM’s designs, there are several companies with architecture licenses. Instead of receiving an IP block, these licensees are free to create custom ARM-compatible designs to meet different requirements. While the identity of these architecture licensees is not necessarily disclosed, current licensees include Qualcomm (Snapdragon), Microsoft, Marvell, and Samsung.
Last year, ARM announced the 64-bit ARMv8 for Application processors. The new architecture is elegant, backwards compatible, and removes several crufty features from the existing ARMv7. Applied Micro simultaneously demonstrated an FPGA implementation of X-Gene, a custom 64-bit server processor based on ARMv8 that is expected in 2013. Applied Micro and others such as Microsoft helped ARM to define and shape the new architecture. More recently, Samsung and Cavium Networks have both taken architectural licenses and along with start up Calxeda, are expected to produce ARM-based server processors. As an aside, it is quite possible that ARM has developed an additional server profile, or these companies could be using the A-profile.
One of the main motivations for ARMv8 was memory addressing. The existing architecture was limited to a 4GB virtual address space, which is an uncomfortable constraint for systems with 2GB or more physical memory. The first round of 64-bit extensions were developed in the 1990′s for server-oriented RISC families. In the early 2000′s, the client-centric x86 bumped into the same virtual memory limitations and was extended to x86-64. Now a decade after x86, ARM-based tablets routinely ship with 1-2GB of memory, approaching the practical limit. With 20 years of history, ARM had a tremendous opportunity to study different approaches and learn from the mistakes and successes of others.

Overview
The ARMv8 architecture is a relatively elegant and compatible with ARMv7 for Application processors (i.e. A-profile) targeted at general purpose workloads. The most significant aspect of ARMv8 is the addition of a new 64-bit instruction set to complement the existing 32-bit ISA. The new instructions are known as A64 and operate on the AArch64 architectural state. ARMv8 also comprises A32 and T32 (for Thumb) and AArch32, which are backward compatible with ARMv7. However, the A64 instruction set is entirely separate and actually uses a slightly different format and new decoding tables.
ARMv7 is undeniably a RISC instruction set, but it has quite a few rather unattractive features that both clash with the ideals behind RISC and complicate real implementations. This is hardly surprising, as most RISC architectures had particular quirks that stemmed from history and ARM is no different. Many of the oddities in ARM are a result of the focus on embedded computing, e.g. banked registers for fast interrupt handling.
While ARMv8 is intended to be backwards compatible, A64 is moderately different from the existing 32-bit ARM architecture. A64 both adds capabilities and also eliminates some of the more obnoxious aspects of the architecture. Some of the new enhancements in A64 are applicable to A32, so software written for AArch32 is not compatible with ARMv7 implementations. Likewise, some instructions have been removed in AArch64 that may impact ARMv7 code – although these changes are unlikely to impact most software.
ARM targeted two data models for the 64-bit mode, to address the key OS partners. The first is LP64, where integers are 32-bit, and long integers are 64-bit, which is used by Linux, most UNIXes and OS X. The other is LLP64, where integers and long integers are 32-bit, while long long integers are 64-bit, and favored by Microsoft Windows.
One of the more substantial changes in AArch64 is the new exception and privilege model. AArch64 includes 4 Exception Levels (0-3), which replaces the 8 different processor modes found in ARMv7. EL0 loosely corresponds to user-mode, EL1 to kernel-mode with EL2 for hypervisors and EL3 for ARM’s TrustZone security monitor. EL3 is the most privileged, with EL0 as the least. The new privilege model is much simpler and relatively similar to existing approaches, such as x86 ring.
As with other 64-bit ISAs, there is a high degree of inter-operability, as shown by Figure 1. AArch64 hypervisors and/or OSes can support AArch32 at lower privilege levels (e.g. AArch32 guest OS on a AArch64 hypervisor or AArch32 apps on an AArch64 OS). However, higher privilege levels cannot be in AArch32 if lower levels are AArch64 (e.g. AArch64 OS on AArch32 hypervisor is not valid).
Transitions between AArch32 and AArch64 can only occur at exceptions and exception returns. For backwards compatibility, AArch32 still retains the rather complicated ARMv7 exception model which features 8 different privilege modes.
ARM’s approach of separate ISAs for compatibility is a contrast to Intel and AMD’s solution with x86, where the newer modes are truly an extension of the existing architecture. In x86, older 16-bit and 32-bit instructions are essentially a subset of the newer 32-bit and 64-bit operating modes. An extended ISA allows mixing new and old code more seamlessly. However, separate ISAs are necessary to eliminate the older exception model in ARMv7 and move to the newer and more elegant system in AArch64.
The one surprise in ARMv8, is the omission of any explicit support for multi-threading. Nearly every other major architecture, x86, MIPS, SPARC, and Power has support for multi-threading and at least one or two multi-threaded implementations. While initially billed as a technique for servers, it is helpful for a variety of software. However, multi-threading is very difficult to validate, and the engineers at ARM may have simply felt that handling the transition to a 64-bit architecture was challenging enough. Certainly, should the lack of multi-threading become a significant competitive disadvantage, it can be added in the future.

General Register State

AArch32 was not particularly regular and one of the biggest complications was the relationship between the registers and exception modes. AArch32 includes 13 general registers (R0-12), the Program Counter (R15) and 2 banked registers that contain the Stack Pointer (R13) and Link Register (R14). The user and system modes share these 16 registers and a Program Status Register (PSR). The fast interrupt (FIQ) mode shares R0-7 and the PC, with its own private R8-14 and Saved PSR. All other exception modes have private banked registers and Saved PSRs. This complicated register banking was one of the techniques originally used to reduce the latency for exceptions, which made ARM particularly suitable for embedded controllers. However, this has the drawback of requiring >40 registers, of which less than half can be used simultaneously – a clear problem from the standpoint of power and area efficiency.
Like x86, ARM took the opportunity to extend, expand and simplify the architectural registers. Naturally, the new GPRs are all 64-bits wide to handle larger addresses. 32-bit accesses use the lower half of registers and either ignore or zero out the upper half. There are more GPRs, and the banking is reduced to 4 different levels. There are 30 GPRs (X0-29), a Procedure Link Register (X30), and X31 acts as a hardwired zero register. Unlike A32, the PC is a special named register that can only be used for explicit control flow instructions and certain addressing modes. Additionally, each of the 4 privilege levels has 3 private banked registers, the Exception Link Register, Stack Pointer and Saved PSR. The AArch32 registers map onto the lower half of the AArch64 registers, which enables running AArch32 on top of AArch64.
Vector Register State

As with most popular architectures, ARMv7 has scalar floating point (VFP) and vector extensions with integer and floating point data (NEON, also known as Advanced SIMD). In ARMv7, these two extensions share a single register file. Both VFP and SIMD are carried over to AArch64, along with the shared register file. However, the two extensions are a standard part of ARMv8, whereas they were optional in some ARMv7 implementations.
Previously, there were 32 vector registers, each 64-bits wide. Pairs of adjacent registers were aliased to provide 16 virtual 128-bit registers for the SIMD instructions. Leaving no stone unturned, ARM’s architects took the opportunity to tweak this arrangement. In ARMv8, all 32 vector registers (V0-V31) are extended to 128-bits, doubling the capacity. Instead of using pairs of smaller registers to form larger virtual registers, the lower half of these 128-bit registers alias to the existing 64-bit registers. As with the GPRs, partial accesses will either ignore or zero out the upper half of a vector register.
ARMv8 Instruction Set Changes

As to be expected, the most substantive changes in the A64 ISA are the memory model and related instructions. The A64 instruction set that operates on AArch64 is largely similar to the existing ISA, but without various idiosyncrasies that are problematic for modern microprocessors. As with A32, instructions are fixed length, requiring 32-bits to specify as many as 3 operands.
Unlike ARMv7 though, there is currently no support for a 64-bit version of Thumb to improve instruction density. One challenge to a potential T64 instruction set is that larger addresses and branch offsets will crowd the instruction format. A64 already decreased the offset range for conditional branches to +/-1MB (from +/-32MB in ARMv7), and further reductions would be rather painful.
All instructions in A32 were conditional, using predication. However, predication uses bits in the instruction encoding that are already at a premium due to doubling the number of registers in AArch64. Moreover, predication complicates out-of-order execution since it adds an extra input that must be renamed. In A64, the only conditional instructions are branch, comparison and select. While this will slightly increase code size, the simplification is worthwhile.
Similarly, A32 had an in-line shifter that could be used for ‘free’ with nearly every integer instruction. However, shifts are notorious difficult to implement at high frequency due to the complicated wiring. An implicit shift in every instruction will increase the length of pipeline stages and reduce the overall frequency. A64 instructions can apply a very limited shift to the destination register, and there are new instructions to handle more complicated cases such as variable shifts.
Most of the changes to VFP in A64 are relatively minor. VFP is intended for single precision and double precision scalar computation. There are new instructions to satisfy the IEEE754-2008 standard, particularly calculating the min and max of two numbers. Floating point comparisons now set the integer condition flags, rather than the flags in the FP Status Register. There are also new conversion instructions between various FP formats, and also the new 64-bit integers.
The advanced SIMD is the vector counterpart to VFP and has been more aggressively enhanced. In particular, the vector instructions in ARMv7 could operate on integer, single precision and rarely polynomial data. In A64, the vector elements also include double precision floating point and have full IEEE support with the required rounding modes, and handling of denormals and NaNs.
Similar to SSE or AVX, the advanced SIMD instructions are variable length vectors that depend on the size of the registers and data types. The underlying registers are either 64-bit or 128-bit and can pack from 1-16 elements. The integer data types are mostly unchanged, spanning from a single byte to 64-bits. Floating point data can be stored in half-precision, but operations are all single precision or double precision.
With the move to 128-bit registers, A64 includes new instructions for inserting and extracting vector elements. There are also three cross-lane instructions for vector reductions, specifically summing and taking the minimum or maximum value.
Several existing instructions including comparison, add, absolute value and negation have been extended to operate on 64-bit integer elements. There are also new instructions for data type conversion, floating point normalization and saturating integer arithmetic. Lastly, ARMv8 includes a variety of optional cryptographic instructions, that are intended to complement existing hardware accelerators. ARM opted to focus on AES encryption, the SHA1 and SHA256 hashing algorithms and Galois fields with 16 new instructions.Virtual Address Space

The most prominent and visible aspect of ARMv8 and A64 is extending the virtual addressing, but there are many other improvements to the memory architecture. Currently, AArch64 features two 48-bit virtual address spaces, one for the kernel and one for applications. Application addressing starts at 0 and grows upwards, while kernel space grows down from 264; any references to unmapped addresses in between will trigger a fault. Pointers are sign extended to 64-bits, and can optionally be configured to use the upper 8-bits for tagging pointers with additional information.
The translation tables for each virtual address space are mapped using either traditional 4KB pages or a new larger 64KB page. The minimum page size determines which page table format will be used. The 64KB pages can improve performance, at the cost of substantially increasing memory fragmentation and utilization.
For virtual address spaces with 4KB pages, a 4 level table is used with 9 bits translated per lookup. In this case, 64KB pages will help index more data, but not reduce the number of look ups. When using 64KB pages, each lookup provides 13 address bits and only 3 levels are necessary. In fact, if addresses smaller than 42-bits are used, the 64KB page tables only need two lookups for translation.
Addressing and Memory Instructions

Like all RISCs, ARM is a strict load/store instruction set that separates memory accesses from arithmetic. ARMv7 and A32 have a single relatively nice indexed addressing mode with optional pre- and post-incrementing. A base register is added to a scaled (i.e. shifted) offset (either another register or a immediate). Optionally, the offset pre- or post-updates the base register, which is useful for handling loops. Since the PC is an ordinary register in AArch32, the indexed addressing mode can be used for PC-relative addressing as well. Since ARMv6, unaligned memory accesses have been supported for single loads or stores.
A64 addressing is generally similar in terms of capabilities, but has been adapted to simplify address calculation. There are two separate addressing modes, the familiar indexed mode and a new PC-relative mode, since the PC cannot be accessed like a regular register. As with ARMv7, unaligned accesses are allowed, but have a performance penalty.
The indexed addressing is still robust, but the incrementing is limited to simplify the critical path in address generation. The 64-bit base register is added to a scaled offset. The offset can be an immediate, a 64-bit register or a sign-extended 32-bit register. However, pre-incrementing is only available with unscaled immediate offsets. Any load or store can post-increment with an unscaled immediate, but only SIMD loads and stores can use post-increment with a register offset. The immediates are generally limited to 9-bits signed. However, for base plus offset a scaled 12-bit unsigned immediate is available.
A new literal addressing mode is used to calculate PC-relative addresses when accessing at least 32-bits of data. Literal addressing replaces the base register with the PC and adds a 19-bit signed offset, giving a relatively limited range of +/-1MB. While this preserves some PC-relative capabilities, it is significantly less flexible than addressing in AArch32.
ARMv7 included several instructions that could access multiple memory locations. In particular, load multiple and pop can read all of the registers from memory, while store multiple and push can write all the registers to memory. These instructions must be micro-coded to handled mis-speculation, exceptions and interrupts and are a potential source of complexity. AArch64 eliminates all multiple memory access instructions to simplify the microarchitecture at the cost of instruction density. To accelerate multiple accesses, two new instructions, load pair and store pair have been added.
Load and store pair access a pair of independent registers from adjacent memory locations with unaligned support. However, the addressing modes are somewhat more limited than normal accesses. Specifically, the pair access instructions can only use the base register plus a scaled 7-bit signed immediate, with optional pre- and post-increment. The pair instructions are clever, since only a single address calculation is needed, saving a little power. The pair instructions are also available as non-temporal accesses, although with only base plus immediate addressing.
Memory Ordering Model

As part of defining ARMv8, the architects paid careful attention to defining a clean memory model. This is particularly crucial for an architecture which will have many different teams working on implementations, since memory ordering is responsible for the most complex and difficult bugs in both hardware and software.
ARMv8 has a Release Consistency memory model, which is relatively weak. It is very similar to the Itanium memory model, and aligns well with C++11. This choice was motivated primarily by power efficiency. Generally, weak ordering models are more difficult to program, because there are few guarantees. However, weak ordering can also reduce the buffering that is required for in-flight loads and stores in a multi-processor system and reduce power consumption.
In the ARMv8 memory model, an aligned memory access that targets a single GPR is guaranteed to be atomic. Load pair and store pair instructions are guaranteed to appear as two individual atomic accesses, if targeting GPRs and naturally aligned. Unaligned accesses are not atomic, and as a practical matter are likely to be split into at least two accesses and a shift. Moreover, vector memory accesses (whether SIMD or scalar FP) are not guaranteed to be atomic at all. To allow programmers to write concurrent software, a number of synchronization primitives are available.
ARMv7 and v8 features three different types of barriers: a Data Synchronization Barrier (DSB), Data Memory Barrier (DMB), and an Instruction Synchronization Barrier (ISB). A DSB stalls the processor until all pending loads and stores have completed. A DMB forces all earlier (in program order) memory accesses to become globally visible before any subsequent accesses. An ISB flushes the CPU pipeline and any prefetch buffers, forcing any subsequent instructions to be fetched from cache or memory. Since ARM does not have coherent instruction caching, this is necessary (but not sufficient) for modifying instructions in memory.
ARMv7 and v8 also incorporate exclusive (or atomic) memory accesses, which are sometimes described as a load-linked and store-conditional (LL/SC). The load-linked instruction will read a value from an address in memory, and then the store-conditional will write a new value to the same address in memory if no other writes to the address have occurred. LL/SC is quite useful for constructing other synchronization primitives such as spinlocks. The LL/SC can be combined with pair instructions to atomically update a location that spans two registers.
ARMv8 introduces the new and elegant one-sided fences associated with Release Consistency: load-acquire and store-release. Unlike the barriers in ARMv7, these fences are address-based synchronization primitives. A load-acquire guarantees that any later (in program order) memory accesses will only be visible after the load-acquire. A store-release guarantees that all earlier memory accesses will be visible before the store-release becomes visible. Moreover, the store-release becomes visible to all caching agents in the system simultaneously. The two can be combined to form a full fence as well, a store-release and a load-acquire will be globally visible in program order.
The address-based synchronization primitives, load-acquire/store-release and LL/SC are all limited to only use base register addressing, with no offsets, indexing or increments, which simplifies the implementation.
Conclusion

The ARMv8 architecture is classically British; a clean and elegant 64-bit instruction set, with backwards compatibility for existing 32-bit software. The new AArch64 is certainly an improvement over ARMv7, with many improvements above and beyond simply extending the virtual address space to 48-bits.
The most notable additions in ARMv8 are the larger and highly regular integer register file, double precision vectors with IEEE support, and new synchronization primitives with a well-defined memory ordering model. In some respects though, the more significant changes came not from adding features, but removing them.
Like x86, ARMv7 had a fair bit of cruft, and the architects took care to remove many of the byzantine aspects of the instruction set that were difficult to implement. The peculiar interrupt modes and banked registers are mostly gone. Predication and implicit shift operations have been dramatically curtailed. The load/store multiple instructions have also been eliminated, replaced with load/store pair. Collectively, these changes make AArch64 potentially more efficient than ARMv7 and easier to implement in modern process technology.
There are no ARMv8 implementations available to judge the merits of the architecture in practice. But overall, ARMv8 is clearly a sound design that was well thought out and should enable reasonable implementations.
The vast majority of companies will wait for a licensable core design from ARM. However, those with the resources and expertise to design a CPU core will forge ahead and should have a time to market advantage and a potential differentiating factor. Applied Micro should be first to market, but others will swiftly follow, including Cavium Networks, Qualcomm, Samsung, and Nvidia.
Certainly, the next few years should prove very interesting. The number of ARMv8 architecture licensees looks set to grow, which should inject some additional diversity into the industry. However, it is unclear whether the market is large enough to support so many companies in the long term. Future ARMv8 cores will undoubtedly be found in Apple’s iPhone and iPad, along with Android devices from TI, Samsung, and others. The real question is whether ARMv8 will enable ARM’s partners to move up the value chain to servers and notebooks. However, that requires competing with Intel, which has a massive advantage in process technology over the rest of the industry.

David Kanter大神的鸿篇巨制，感兴趣的人多的话我会考虑翻译下
http://www.realworldtech.com/arm64/

ARM Goes 64-bit 华为TAISHAN ARM服务器Hi1612 ARM架构64位处理器 Life Goes On 中国的64核ARM架构芯片 win7 64bit的问题（已经解决，谢谢大家）(他们都说标题 ... 请教一下，win7 64BIT的系统还有必要像以前那样做READYB ... NVIDIA 发表 Tegra K1 的第二版本：64 bit 支持、2.5GHz ... 菜鸟求问：Win10中文专业版（64bit）的官方镜像下载有没 ... ARM宣布新架构ARMv8 终于进入64位世界中国 Phytium Technology 发布64核ARM服务器芯片飞腾发布代号为“火星”的64核心ARM处理器 light arm