We propose an instruction set compiled simulation (IS-CS)technique where the program is compiled prior to run time and executed interpretively as shown in Figure 3.The simulator recognizes if the program code of a previously ex-ecuted address has changed and initiates a re-decoding.We achieve both the performance of compiled simulation and ?exibility of interpretive simulation.The simulation perfor-mance of the IS-CS technique is upto 40%better than the best known result [1]in this category.There are two rea-sons for its superior performance.First,the time consum-ing instruction decoding process is moved to compile time while maintaining the ?exibility of interpretive simulation.Second,we use a novel instruction abstraction technique to generate aggressively optimized decoded instructions that further improve simulation performance.
3.INSTRUCTION SET COMPILED SIMU-LATION
We developed the instruction set compiled simulation (IS-CS)technique with the intention of combining the full ?ex-ibility of interpretive simulation with the speed of the com-piled principle.The basic idea is to move the time-consuming instruction decoding to compile time as shown in Figure 3.The application program,written in C/C++,is compiled using the gcc compiler con?gured to generate binary for the target machine.The instruction decoder decodes one bi-nary instruction at a time to generate the decoded program for the input application.The decoded program is com-piled by C++compiler and linked with the simulation li-brary to generate the simulator.The simulator recognizes if the previously decoded instruction has changed and initiates re-decoding of the modi?ed instruction.If any instruction
Figure3:Instruction Set Compiled Simulation Flow is modi?ed during execution and subsequently re-decoded, the location in instruction memory is updated with the re-decoded instruction.To improve the simulation speed we use a novel instruction abstraction technique that generates optimized decoded instructions as described in Section3.1. As a result the computation during run-time is minimized. This technique achieves the speed of compiled simulation due to compile-time decoding of application as described in Section3.2.Section3.3describes the simulation engine that o?ers the full?exibility of interpretive simulation.
3.1Instruction Abstraction
In traditional interpretive simulation(e.g.,Simplescalar [3])the decoding and execution of binary instructions are done using a single monolithic function.This function has many if-then-else and switch/case statements that perform certain activities based on bit patterns of opcode,operands, addressing modes etc.In advanced interpretive simulation (e.g.,LISA[1])the binary instruction is decoded and the decoded instruction contains pointers to speci?c functions. There are many variations of these two methods based on e?ciency of decode,complexity of implementation,and per-formance of execution.However,none of these techniques exploit the fact that a certain class of instructions may have a constant value for a particular?eld of the instruction.For example,a majority of the ARM instructions execute uncon-ditionally(condition?eld has value always).It is a waste of time to check the condition for such instructions every time they are executed.
Clearly,when certain input values are known for a class of instructions,the partial evaluation[13]technique can be applied.The e?ect of partial evaluation is to specialize a program with part of its input to get a faster version of the same program.To take advantage of such situations we need to have separate functions for each and every possible for-mat of instructions so that the function could be optimized by the compiler at compile time and produce the best per-formance at run time.Unfortunately,this is not feasible in practice.For example,consider the ARM data process-ing instructions.It can have16conditions,16operations, an update?ag(true/false),and one destination followed by two source operands.The second source operand,called shifter operand,has three?elds:operand type(reg/imm), shift options(5types)and shift value(reg/imm).In total, the ARM data processing instructions have16x16x2x2 x5x2=10240possible formats.
Our solution to this problem is to de?ne instruction classes, where each class contains instructions with similar formats. Most of the time this information is readily available from the instruction set architecture manual.For example,we de-?ned six instruction classes for the ARM processor viz.,Data Processing,Branch,LoadStore,Multiply,Multiple Load-Store,Software Interrupt,and Swap.Next,we de?ne a set of masks for each instruction class.The mask consists of’0’,’1’and’x’symbols.A’0’(’1’)symbol in the mask matches with a’0’(’1’)in binary pattern of the instruction at the same bit position.An’x’symbol matches with both’0’and ’1’.For example,the masks for the data processing instruc-tions are shown below:
"xxxx-001x xxxx-xxxx xxxx-xxxx xxxx-xxxx"
"xxxx-000x xxxx-xxxx xxxx-xxxx xxx0-xxxx"
"xxxx-000x xxxx-xxxx xxxx-xxxx0xx1-xxxx"
We use C++templates to implement the functionality for each class of instructions.For example,the pseudo code for the data processing template is shown below.The template has four parameters viz.,condition,operation,update?ag, and shifter operand.The shifter operand is a template hav-ing three parameters viz.,operand type,shift options and shift value.
Example1:Template for Data Processing Instructions
template
{
SftOper_sftOperand;
Reg_dest,_src1;
public:
.......
virtual void execute()
{
if(Cond::execute())
{
_dest=Op::execute(_src1,_sftOperand.getValue());
if(Flag::execute())
{
//Update Flags
.......
}
}
}
};
We also use a Mask Table for the mapping between mask patterns and templates.It also maintains a mapping be-tween mask patterns and functions corresponding to those templates.
This instruction abstraction technique is used to generate aggressively optimized decoded instructions as described in Section3.2.
3.2Instruction Decoder
Algorithm1decodes one binary instruction at a time to generate the decoded program for the input application. For each instruction in the application program it selects the appropriate template using Algorithm2.It generates a customized template for the instruction using the appro-priate parameter values.Algorithm3brie?y describes the
customized template generation process.Finally,the cus-tomized template is instantiated and appended in the tem-porary program TempProgram.The TempProgram is fed to a C++compiler that performs necessary optimizations to take advantage of the partial evaluation technique,de-scribed in Section3.1,to produce the DecodedProgram.The DecodedProgram is loaded into instruction memory which is a separate data structure than main memory.While the main memory holds the original program data and instruc-tion binaries,each cell of instruction memory holds a pointer to the optimized functionality as well as the instruction bi-nary.The instruction binary is used to check the validity of the decoded instruction during run-time.
Algorithm3describes the template customization pro-cess.The algorithm’s basic idea is to extract the values from speci?c?elds of the binary instruction(e.g.,opcode, operand etc.)and assign those values to the template.We maintain the information for each class of instructions,tem-plates,?eld formats,and mask patterns.These information can be derived from the processor speci?cation described us-ing an Architecture Description Language such as LISA[1], EXPRESSION[7]and nML[12].
Algorithm1:Instruction Decoding
Inputs:Application Program Appl(Binary),MaskTable maskTable. Output:Decoded Program DecodedProgram.
Begin
TempProgram={}
foreach binary instruction inst with address addr in Appl
template=DetermineTemplate(inst,maskTable)
百度搜索“70edu”或“70教育网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读,70教育网,提供经典知识文库Instruction set compiled simulation A technique for fast and(2)在线全文阅读。
相关推荐: