Symmetric Multiprocessing

From OSDev Wiki
Jump to navigation Jump to search

This article is a stub! This page or section is a stub. You can help the wiki by accurately contributing to it.

Symmetric Multiprocessing (or SMP) is one method of having multiple processors in one computer system. In an SMP system (as opposed to a NUMA system) all logical cores are able to see the entire memory for the system. Note that SMP and NUMA are not mutually exclusive however; as Brendan has pointed out on the forums, Intel's Core i7 implements both SMP and NUMA, as well as hyper-threading.

Initialisation of an old SMP system

The startup sequence is different for different CPUs. Intel's system programmer's manual (section 7.5.4) contains the initialization protocol for Intel Xeon processors, and doesn't cover older CPUs. For the generic "all CPU types" algorithm, see Intel's Multi-processor Specification.

For 80486 (with an external 8249DX local APIC), you must use an INIT IPI followed by an "INIT level de-assert" IPI without any SIPI's. This means you can't tell them where to start executing (the vector part of a SIPI) and they always start executing BIOS code. In this case you set the BIOS's CMOS reset value to "warm start with far jump" (i.e. set CMOS location 0x0F to the value 10) so that the BIOS will do a jmp far ~[0:0x0469]", and then put the segment & offset of your AP entry point at 0x0469.

The "INIT level de-assert" IPI isn't supported on newer CPUs (Pentium 4 and Intel Xeon), and AFAIK it is ignored completely on these CPUs.

For newer CPUs (P6, Pentium 4) one SIPI is enough, but I'm not sure if older Intel CPUs (Pentium) or CPUs from other manufacturers need a second SIPI or not. It's also possible that the second SIPI is there in case there's a delivery failure for the first SIPI (bus noise, etc).

I normally send the first SIPI and then wait to see if the AP CPU increases a "number of started CPUs" counter. If it doesn't increase this counter within a few milli-seconds, then I send the second SIPI. This is different to Intel's generic algorithm (which has a 200 micro-second delay between SIPIs), but trying to find a time source capable of accurately measuring a 200 micro-second delay during early boot isn't so easy. I've also found that on real hardware, if the delay between SIPIs is too long (and you don't use my method) an AP CPU can run the OS's early AP startup code twice (which in my case would lead to the OS thinking there's twice as many AP CPUs as there are).

You can broadcast these signals across the bus to start every device that is present. However by doing so you might also enable the processors that were disabled on purpose (because they were defective).

Finding information using MP Table

You may want to use newer ACPI instead of MP Table. If so, please see the next section.

Some information (which may not be present on newer machines) dedicated for multiprocessing is available. First one must find the MP Floating Pointer Structure. It is aligned on a 16 byte boundary, and contains a signature at the start "_MP_" or 0x5F504D5F. The OS must search in the EBDA, the BIOS ROM space, and last kilobyte of "base memory"; the size of base memory is specified in a 2 byte value at 0x413 in kilobytes, minus 1K. Here is what the structure looks like:

struct mp_floating_pointer_structure {
    char signature[4];
    uint32_t configuration_table;
    uint8_t length; // In 16 bytes (e.g. 1 = 16 bytes, 2 = 32 bytes)
    uint8_t mp_specification_revision;
    uint8_t checksum; // This value should make all bytes in the table equal 0 when added together
    uint8_t default_configuration; // If this is not zero then configuration_table should be 
                                   // ignored and a default configuration should be loaded instead
    uint32_t features; // If bit 7 is then the IMCR is present and PIC mode is being used, otherwise 
                       // virtual wire mode is; all other bits are reserved
}

Here is what the configuration table, pointed to by the floating pointer structure looks like:

struct mp_configuration_table {
    char signature[4]; // "PCMP"
    uint16_t length;
    uint8_t mp_specification_revision;
    uint8_t checksum; // Again, the byte should be all bytes in the table add up to 0
    char oem_id[8];
    char product_id[12];
    uint32_t oem_table;
    uint16_t oem_table_size;
    uint16_t entry_count; // This value represents how many entries are following this table
    uint32_t lapic_address; // This is the memory mapped address of the local APICs 
    uint16_t extended_table_length;
    uint8_t extended_table_checksum;
    uint8_t reserved;
}

After the configuration table, there are entry_count entries describing more information about the system, then after that there is an extended table. The entries are either 20 bytes to represent a processor, or 8 bytes for something else. Here are what the processor and IO APIC entries look like.

 
struct entry_processor {
    uint8_t type; // Always 0
    uint8_t local_apic_id;
    uint8_t local_apic_version;
    uint8_t flags; // If bit 0 is clear then the processor must be ignored
                   // If bit 1 is set then the processor is the bootstrap processor
    uint32_t signature;
    uint32_t feature_flags;
    uint64_t reserved;
}

Here is an IO APIC entry.

struct entry_io_apic {
    uint8_t type; // Always 2
    uint8_t id;
    uint8_t version;
    uint8_t flags; // If bit 0 is set then the entry should be ignored
    uint32_t address; // The memory mapped address of the IO APIC is memory
}

For more information, see http://www.intel.com/design/pentium/datashts/24201606.pdf, chapter 4.

Finding information using ACPI

You should be able to find a MADT table in the RSDT table or in the XSDT table. The table has a list of local-APICs, number of which should be the same as the number of cores on your processor. Details of these tables are not listed here, but you can find them easily on this wiki.

AP startup

After you've gathered the information, you'll need to disable the PIC and prepare for I/O APIC. You also need to setup BSP's local APIC. Then, startup the APs using SIPIs.

Startup Sequence

The MP specification contains a standard method to start an AP, however it is not recommended to be used, as it contains very precise timings, which, if done incorrectly, can lead to several problems. Brendan offers an alternative method, which should be done for each AP individually. First send an init IPI and wait 10 milliseconds. Then send a SIPI, and poll for a flag to be set by the AP's trampoline code with a timeout of 1 millisecond. If the timeout was reached, send another SIPI, and poll for the same flag, but this time with a timeout of 1 second. If the AP managed to set the flag, the BSP should set another flag to allow the AP to continue (probably to wait for the scheduler to have a process it needs executing).

However that alternative method is highly overcomplicated. It is much simpler to have it the other way around, just send the two SIPI and make the APs to wait for the BSP, see in the example code below.

Timing

The easiest method for the timings is to use the PIT's mode 0. Write 0x30 to IO port 0x43 (select mode 0 for counter 0), then write your count value to 0x40, LSB first (e.g. write 0xA9 then 0x4 for a millisecond). To check if counter has finished, write 0xE2 to IO port 0x43, then read a status byte from port 0x40. If the 7th bit is set, then it has finished.

Sending IPIs

IPIs are sent through the BSP's LAPIC. Find the LAPIC base address from the MP tables or ACPI tables, then you can write 32-bit words to base + 0x300 and base + 0x310 to send IPIs. For a init IPI or startup IPI, you must first write the target LAPIC ID into bits 24-27 of base + 0x310. Then, for an init IPI, write 0x00004500 to base + 0x300. For a SIPI, write 0x00004600, ored with the page number at which you want to AP to start executing, to base + 0x300. For more information, see http://wiki.osdev.org/APIC#Local_APIC_registers.

Example Code

When you start testing your SMP code on real machines, you'll realize that they do not keep the standard. You must not do Broadcast INIT IPIs nor Broadcast SIPIs. The following example has a decant amount of error checking to be used on real hardware. It needs three variables

  • numcores the number of valid cores
  • lapic_ids an array of Local APIC IDs, numcores element
  • lapic_ptr the pointer to the Local APIC registers

You can get these from the PCMP table above, or see the example code on the MADT page.

BSP Initialization Code

volatile uint8_t aprunning = 0;  // count how many APs have started
uint8_t bspid, bspdone = 0;      // BSP id and spinlock flag
// get the BSP's Local APIC ID
__asm__ __volatile__ ("mov $1, %%eax; cpuid; shrl $24, %%ebx;": "=b"(bspid) : : );

// copy the AP trampoline code to a fixed address in low conventional memory (to address 0x0800:0x0000)
memcpy((void*)0x8000, &ap_trampoline, 4096);

// for each Local APIC ID we do...
for(i = 0; i < numcores; i++) {
	// do not start BSP, that's already running this code
	if(lapic_ids[i] == bspid) continue;
	// send INIT IPI
	*((volatile uint32_t*)(lapic_ptr + 0x280)) = 0;                                                                             // clear APIC errors
	*((volatile uint32_t*)(lapic_ptr + 0x310)) = (*((volatile uint32_t*)(lapic_ptr + 0x310))) | (i << 24);         // select AP
	*((volatile uint32_t*)(lapic_ptr + 0x300)) = (*((volatile uint32_t*)(lapic_ptr + 0x300)) & 0xfff00000) | 0x00C500;          // trigger INIT IPI
	do { __asm__ __volatile__ ("pause" : : : "memory"); }while(*((volatile uint32_t*)(lapic_ptr + 0x300)) & (1 << 12));         // wait for delivery
	*((volatile uint32_t*)(lapic_ptr + 0x310)) = (*((volatile uint32_t*)(lapic_ptr + 0x310))) | (i << 24);         // select AP
	*((volatile uint32_t*)(lapic_ptr + 0x300)) = (*((volatile uint32_t*)(lapic_ptr + 0x300)) & 0xfff00000) | 0x008500;          // deassert
	do { __asm__ __volatile__ ("pause" : : : "memory"); }while(*((volatile uint32_t*)(lapic_ptr + 0x300)) & (1 << 12));         // wait for delivery
	mdelay(10);                                                                                                                 // wait 10 msec
	// send STARTUP IPI (twice)
	for(j = 0; j < 2; j++) {
		*((volatile uint32_t*)(lapic_ptr + 0x280)) = 0;                                                                     // clear APIC errors
		*((volatile uint32_t*)(lapic_ptr + 0x310)) = (*((volatile uint32_t*)(lapic_ptr + 0x310))) | (i << 24); // select AP
		*((volatile uint32_t*)(lapic_ptr + 0x300)) = (*((volatile uint32_t*)(lapic_ptr + 0x300)) & 0xfff0f800) | 0x000608;  // trigger STARTUP IPI for 0800:0000
		udelay(200);                                                                                                        // wait 200 usec
		do { __asm__ __volatile__ ("pause" : : : "memory"); }while(*((volatile uint32_t*)(lapic_ptr + 0x300)) & (1 << 12)); // wait for delivery
	}
}
// release the AP spinlocks
bspdone = 1;
// now you'll have the number of running APs in 'aprunning'

AP Initialization Code

As the application processors start up in real mode, a little Assembly is needed to enter protected mode. Modify this example to your kernel's needs.

; this code will be relocated to 0x8000, sets up environment for calling a C function
    .code16
ap_trampoline:
    cli
    cld
    ljmp    $0, $0x8040
    .align 16
_L8010_GDT_table:
    .long 0, 0
    .long 0x0000FFFF, 0x00CF9A00    ; flat code
    .long 0x0000FFFF, 0x008F9200    ; flat data
    .long 0x00000068, 0x00CF8900    ; tss
_L8030_GDT_value:
    .word _L8030_GDT_value - _L8010_GDT_table - 1
    .long 0x8010
    .long 0, 0
    .align 64
_L8040:
    xorw    %ax, %ax
    movw    %ax, %ds
    lgdtl   0x8030
    movl    %cr0, %eax
    orl     $1, %eax
    movl    %eax, %cr0
    ljmp    $8, $0x8060
    .align 32
    .code32
_L8060:
    movw    $16, %ax
    movw    %ax, %ds
    movw    %ax, %ss
    ; get our Local APIC ID
    mov     $1, %eax
    cpuid
    shrl    $24, %ebx
    movl    %ebx, %edi
    ; set up 32k stack, one for each core. It is important that all core must have its own stack
    shll    $15, %ebx
    movl    stack_top, %esp
    subl    %ebx, %esp
    pushl   %edi
    ; spinlock, wait for the BSP to finish
1:  pause
    cmpb    $0, bspdone
    jz      1b
    lock    incb aprunning
    ; jump into C code (should never return)
    ljmp    $8, $ap_startup
// this C code can be anywhere you want it, no relocation needed
void ap_startup(int apicid) {
	// do what you want to do on the AP
	while(1);
}

See Also

Articles

Threads

External Links