Multicore Processor Architecture

Anuradha Lohar
9 min readJun 11, 2021

--

Introduction

Multicore processor gives the functionality of parallel processing with reduced sustainable computation time. In present scenario of multicore, the two main challenges that need to be addressed are: to meet the increased demand of high computing processors and reduced battery power of the processor that helps such systems to be mobile across the globe. These two challenges are counterproductive as if one is addressed the other becomes worsen and vice versa.

Multicore Processer

Internal Structure

Multicore processors can be defined as a system whose central processing unit is divided into many logical parts called core and each core may have one or more private caches as shown in figure 1. The figure shows four cores having one or more caches and each core is connected via an interconnection network to the memory of the system.

Core with cache levels

The inside view of a single core is shown in figure. The figure shows a processor with a single core consisting of a private cache L1 and a shared cache L2 . The number of cache levels is decided by how far the main memory is or how many cycles will it take to access the main memory. There are two types of multicore processors- homogeneous and heterogeneous multicore processors. In homogeneous multicore processors all the cores are identical whereas heterogeneous include only non-identical cores.

Architecture of any multicore processor

On the basis of common features or characteristics the multicore processor architecture can be broadly classified as : application class, power/performance, processing elements, memory system, and accelerators/integrated peripherals.

CORES OPERATED ON DVS SCHEMES

i ) Application Class: In this the architecture of multicore processor is mainly focused on the requirement of specific application domain. This results in several positive outcomes but often the multicore for a particular application domain cannot be used or have adverse effects when used in other domains. An application may fall in one of the two classes in their execution phase: data processing dominated and control processing dominated. In data dominated the operations are performed on a set of data with little or no data re usage. This fact gives the processing to be performed in parallel with high throughput and performance for a large data. Some of the applications are image processing, audio processing, wireless base band processing etc. In control processing dominated class the application deals with high level of conditional branching and parallelism with a high amount of data reuse. For example, data compression, decompression, network processing and query processing.

ii) Power/performance: Some applications need good power/performance requirements. For example in play stations the game provides a real time environment. This feel can only be provided if the application uses the multicore architecture with first class performance design constraints. But such games when comes in mobile phones needs a good battery life and thus power is also a constraint for such applications. Thus, for such application the architecture of the processor should be such that power consumed should be less and performance output is high.

iii) Processing elements: The architecture of multicore depends on the type of instruction set architecture which may be Reduced Instruction Set Computer (RISC) or Complex Instruction Set Computer (CISC). Processing element used in core also defines the architecture of multicore processor. On the basis of processing element there are two types of cores: in-order cores and out-order cores. In-order cores have small die area, low consumption of power and can work easily with large applications having larger level of parallelism and less sensitive serial sections. The out order cores needs more die area and is not suitable for power efficient systems. However, such cores perform well for applications having a variety of behaviors and high performance and output is needed. To increases the performance of multicore processor it is better to use Single-Instruction Multiple-Data (SIMD) or Very Long Instruction Word (VLIW) architectures.

iv) Memory systems: The memory system architecture point of view includes the caches and their levels, consistency model, cache coherence support and the intrachip interconnect. These determine the way in which the cores will communicate providing a high efficient programmability and parallelism to the system. The consistency model basically defines the order in which the instructions are to be executed. The strong consistency models have strict ordering constraints and are difficult to design whereas weak models are less complex and are easier to design memory system. The cache configuration basically deals with the amount, number and levels of caches required by the system. The amount of cache required for an architecture point of view depends on the application. More the data is reused more will be the size of cache needed. Larger caches give better performance but it also includes the effect on die area and power budget. The number of cache levels depends on how far away the main memory is from each processing elements. The intra chip interconnect is responsible for general communication among processing elements and cache coherence. The interconnect for inter core communication includes bus, ring, crossbar and network on chip. The type of programming paradigm the architecture supports depends cache coherence. Cache coherence defines the consistency of data visible to all other processing cores.

v) Accelerators/ integrated peripherals: Another architecture point of view in which a multicore processor can be designed is with respect to accelerators or integrated peripherals such as highly specialized processors that cannot be made inaction efficiently by the software.

B. Issues in designing a multicore processor Following are the issue to be looked upon for a good design of multicore processor.

  1. Cache related issues:
Cache related issues in multicore processor architecture

i) Amount of cache: The size of cache required for a multicore processor is application dependent. The application where the data reusability is more it is preferred to use big cache. Bigger the size of cache faster will be the speed of accessing and better will be the performance but higher will be the cost.

ii) Number of cache levels: Another cache issue is to decide the number of cache levels a cache of multicore may have. It is not necessary for the entire cores cache to have equal number of cache levels. Basically the number of cache levels is decided by how far the main memory is or how many cycles will it take to access the main memory. More the number of cycles more will be the cache levels and faster is the accessing.

iii) Deterministic/nondeterministic performance: In some applications the caches are tagged and managed either by the hardware itself or its local memory. In case of hardware managed tags the tags are stored on the same die area automatically thus reducing the space for computation and hence give a nondeterministic performance. Whereas the tags assigned by the software stream explicitly and managed by the local memory does not store the tags and hence gives more space for storage in the same area. However, the latter case is complex so it is totally application dependent to decide the type of tagging of cache it requires for a good performance.

iv) Reusability: Another cache related issue is how to make the cache reusable. Since all the cores of the multicore processor have a limited amount of their private or shared cache so it is necessary to make the space reused on the basis of how frequently the caches are being accessed. More the caches are reused more space can be provided for computation and faster will be the accessing.

v) Hit/miss rates: If a data is requested during computation and the requested data is found in the core cache then it is said to be a hit else a miss. Hit or miss rate has always been an issue for an ideal processor. It is always preferred to have higher hit rate than miss rates so that the time of accessing is reduced and the computation is fast. This factor is dependent on the size of caches, the writing and page replacement strategy used by the cache and the number of cache levels.

2) Selection of core used: An application assigns its tasks among the various cores of the processor such that the tasks can perform their computation efficiently. In concern with this, two issues is to be dealt with: selection of which type of core which is compatible with the tasks assigned to them and number of cores needed for a particular application.

3) Consistency among cores: Since each core has its private cache, thus the copy of a data in each cache may not be the same. Consistency among cache, called as cache coherence, ensures that a single image of data stored in memory is seen by all the other cores of the processor for computation. Cache coherence can be implemented either via broadcast coherence or directory based coherence. In broadcast coherence only one processor is allowed to perform an operation. When a write occurs an invalidate message is sent to all the other cores and the write is performed only when all the cores acknowledges the permission to it to perform the write operation. All the other operations are delayed until this write is performed and hence provide a strong consistent environment among the cores of the processors. In directory coherence, a directory is used to store that which memory address is used by which cache. A home node is assigned to each address where its directory portion is stored. Whenever a request occurs the processor query the home node of that core to find the set of cores holding that cache address block. The requesting core in turn gets the permission from all the other cores holding that cache block. This scheme allows the read and write request to perform in parallel and hence, suitable for weak consistent models. The broadcast coherence can only be used for application having small number of processor whereas directory coherence scheme is used for processor with large number of cores.

SCHEDULING OF MULTICORES

The scheduling of multi core is performed in such a manner that the real time tasks do not miss their deadlines. The scheduling of multicore includes two types of scheduling: Partitioned scheduling and Global scheduling. In partitioned scheduling different tasks are assigned to cores statically and they are not allowed to migrate from one core to another. The partitioned scheduling is advantageous as there is no migration overhead. But, the use of partitioned scheduling suffers two major disadvantages

(a) This scheduling scheme is inflexible in nature and cannot easily accommodate dynamic tasks without a complete re-partition.

(b) It is related to optimal assignment of tasks to cores which creates an NPhard problem for which polynomial-time solutions result in sub-optimal partitions.

FACTORS AFFECTING PERFORMANCE OFMULTICORE PROCESSOR

The performance of the multicore processor is good if the CPU is utilized properly and the desired outcome is produced in a good response time. Hence following are the main factors that may affect the performance of a multicore processor:

1) Cores configuration: A multicore processor may have two or more cores running on same or different speed. The performance of any multicore processor is affected if the cores are not fully utilized that is the core is idle most of the time. A mechanism should be applied such that the number of idle slots in a core can be reduced resulting in good overall utilization of the system and if the number of idle slots are more then such slots can be combined together to make one big slot in which the core can be made to sleep state.

2) Workload: The workload of multicore processor basically defines the number of tasks: normal or real time tasks, that is to be assigned to the cores of the system. The scalability of workload affects the performance of multicore.

3) Inter Intra core communication: An inter core communication is one in which the two or more cores can communicate via shared memory architecture whereas in intra core communication a core communicates its local cache for any data needed. The more is the intra core communication good will be the performance of the system considering the fact that the hit ratio is good enough than the miss ratio.

--

--