Əsas məzmuna keçin

Digər Mövzular

Endianness

Endianness - byte-ların yaddaşda necə yerləşdirilməsi.

Big-Endian

Most significant byte ən kiçik address-də.

// Value: 0x12345678
// Memory (address increasing →):
// 0x1000: 12 (most significant byte)
// 0x1001: 34
// 0x1002: 56
// 0x1003: 78 (least significant byte)

// "Natural" reading order
// Network protocols (TCP/IP) use big-endian

Architectures:

  • Motorola 68000
  • IBM mainframes
  • Network protocols (network byte order)
  • Java Virtual Machine

Little-Endian

Least significant byte ən kiçik address-də.

// Value: 0x12345678
// Memory (address increasing →):
// 0x1000: 78 (least significant byte)
// 0x1001: 56
// 0x1002: 34
// 0x1003: 12 (most significant byte)

// Efficient for arithmetic (start with least significant)

Architectures:

  • x86/x86-64 (Intel, AMD)
  • ARM (default, but bi-endian)
  • RISC-V (default, but can be configured)

Bi-Endian

Hər iki mode-u dəstəkləyir.

Examples:

  • ARM (can switch)
  • PowerPC
  • MIPS
// ARM example
// CPSR.E bit controls endianness

Conversion

#include <arpa/inet.h>

// Host to Network (big-endian)
uint32_t network_value = htonl(host_value); // 32-bit
uint16_t network_value = htons(host_value); // 16-bit

// Network to Host
uint32_t host_value = ntohl(network_value); // 32-bit
uint16_t host_value = ntohs(network_value); // 16-bit

// Example
uint32_t ip = 0x7F000001; // 127.0.0.1 in host byte order
uint32_t network_ip = htonl(ip); // Convert for network transmission

Detecting Endianness

int is_little_endian() {
uint32_t value = 0x12345678;
uint8_t* byte = (uint8_t*)&value;

if (byte[0] == 0x78) {
return 1; // Little-endian
} else if (byte[0] == 0x12) {
return 0; // Big-endian
}
}

// Or using union
int is_little_endian_union() {
union {
uint32_t i;
uint8_t c[4];
} test = {0x01020304};

return test.c[0] == 4; // Little-endian if true
}

Endianness Problems

// BAD: Writing binary data
FILE* f = fopen("data.bin", "wb");
uint32_t value = 0x12345678;
fwrite(&value, sizeof(value), 1, f); // Endianness-dependent!

// Reading on different endian system:
// Little-endian wrote: 78 56 34 12
// Big-endian reads: 0x78563412 (wrong!)

// GOOD: Explicit byte order
uint32_t network_value = htonl(value);
fwrite(&network_value, sizeof(network_value), 1, f);

// Or write byte by byte
uint8_t bytes[4] = {
(value >> 24) & 0xFF,
(value >> 16) & 0xFF,
(value >> 8) & 0xFF,
value & 0xFF
};
fwrite(bytes, 4, 1, f);

Word Size

32-bit vs 64-bit

Memory Limits

32-bit:

Pointer size: 4 bytes (32 bits)
Address space: 2^32 = 4,294,967,296 bytes = 4 GB

User space: ~3 GB (Linux/Windows)
Kernel space: ~1 GB

Limitation: Can't use more than 4 GB RAM!

64-bit:

Pointer size: 8 bytes (64 bits)
Address space: 2^64 = 16 exabytes (theoretical)

Actually used (x86-64):
- User space: 0x0000000000000000 - 0x00007FFFFFFFFFFF (128 TB)
- Kernel space: 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF (128 TB)
- Total: 256 TB (48-bit addressing)

Future: 5-level paging → 57-bit → 128 PB

Data Type Sizes

// LP64 model (Linux, macOS, Unix)
// ILP32 model (32-bit Windows)

// 32-bit (ILP32) // 64-bit (LP64)
char: 1 byte // char: 1 byte
short: 2 bytes // short: 2 bytes
int: 4 bytes // int: 4 bytes
long: 4 bytes // long: 8 bytes
pointer: 4 bytes // pointer: 8 bytes
size_t: 4 bytes // size_t: 8 bytes

// Windows 64-bit (LLP64)
long: 4 bytes (different!)
long long: 8 bytes

Portability Issues

// BAD: Assume pointer size
int ptr = (int)some_pointer; // Breaks on 64-bit!

// GOOD: Use appropriate types
uintptr_t ptr = (uintptr_t)some_pointer;

// BAD: Assume long is 8 bytes
long value = 0x123456789ABCDEF0; // Truncated on 32-bit Windows!

// GOOD: Use fixed-size types
int64_t value = 0x123456789ABCDEF0LL;

// BAD: Serialize pointer
fwrite(&ptr, sizeof(ptr), 1, file); // Size differs!

// GOOD: Serialize offset or ID
uint64_t offset = ptr - base_address;
fwrite(&offset, sizeof(offset), 1, file);

Performance: 32-bit vs 64-bit

64-bit advantages:

+ More registers (x86-64: 16 GPRs vs x86: 8)
+ Larger address space
+ Better performance for 64-bit arithmetic
+ More efficient calling convention

- Larger pointers (memory overhead)
- Larger cache footprint

Memory overhead:

// 32-bit
struct Node {
int data; // 4 bytes
struct Node* next; // 4 bytes
}; // Total: 8 bytes

// 64-bit
struct Node {
int data; // 4 bytes
char padding[4]; // Alignment
struct Node* next; // 8 bytes
}; // Total: 16 bytes (2× larger!)

Migration Path

32-bit only → 32-bit app on 64-bit OS → 64-bit app

x86 → x86-64 (backward compatible)
ARM32 → ARM64 (AArch64, not fully compatible)

System Calls və Mode Switching

Privilege Levels

System Call Mechanism

System Call Implementation

x86-64:

; User space
mov rax, 2 ; syscall number (open)
mov rdi, filename ; arg 1
mov rsi, O_RDONLY ; arg 2
mov rdx, 0 ; arg 3
syscall ; Enter kernel mode

; Kernel space
; rax contains syscall number
; rdi, rsi, rdx, r10, r8, r9 contain arguments

; Jump to syscall handler
call sys_call_table[rax]

; Return to user space
sysretq

x86 (32-bit, legacy):

; User space
mov eax, 5 ; syscall number (open)
mov ebx, filename ; arg 1
mov ecx, O_RDONLY ; arg 2
mov edx, 0 ; arg 3
int 0x80 ; Software interrupt (slow!)

; Kernel handles interrupt
; Returns via iret

ARM64:

; User space
mov x8, #56 ; syscall number (openat)
mov x0, filename ; arg 1
mov x1, O_RDONLY ; arg 2
svc #0 ; Supervisor call

; Kernel handles SVC exception
; Returns via eret

System Call Overhead

Direct function call: ~1-5 cycles
System call: ~100-300 cycles

Overhead includes:
- Mode switch (ring 3 → ring 0)
- Context save/restore
- Argument validation
- TLB flush (with KPTI)
- Return mode switch (ring 0 → ring 3)

With KPTI (Meltdown mitigation): +30-50% overhead

vDSO (Virtual Dynamic Shared Object)

Kernel code mapped into user space (read-only).

// Instead of syscall:
time_t t = time(NULL); // Normally a syscall

// With vDSO:
// time() reads from kernel-maintained shared memory
// No mode switch needed!
// Much faster for frequently-called syscalls

// Other vDSO functions:
// - gettimeofday()
// - clock_gettime()
// - getcpu()
# Check vDSO
cat /proc/self/maps | grep vdso
# 7fff12345000-7fff12346000 r-xp 00000000 00:00 0 [vdso]

# List vDSO functions
ldd /bin/ls | grep vdso

Context Switching

Context switch - bir process/thread-dən digərinə keçid.

Context Switch Cost

// What needs to be saved/restored:

struct task_struct {
// CPU state
struct pt_regs regs; // General purpose registers
unsigned long rsp; // Stack pointer
unsigned long rip; // Instruction pointer

// FPU/SIMD state
struct fpu fpu_state;

// Memory management
struct mm_struct* mm; // Page table pointer

// Scheduler info
int priority;
unsigned long runtime;

// Many more fields...
};

Context switch overhead:

Direct cost: 5-10 microseconds
- Save/restore registers: ~1 µs
- Switch page table: ~1 µs
- TLB flush: ~2-3 µs
- Scheduler overhead: ~1-2 µs

Indirect cost:
- Cache pollution (cold cache)
- TLB misses
- Branch predictor state lost

Total effective cost: 20-100 µs

Reducing Context Switch Overhead

1. Voluntary context switches:

// Cooperative (thread yields)
sched_yield(); // Give up CPU voluntarily

// Less overhead than involuntary (timer interrupt)

2. Process affinity:

// Pin process to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset); // CPU 0
sched_setaffinity(getpid(), sizeof(cpuset), &cpuset);

// Benefits:
// - Warmer cache (less cache misses)
// - No TLB flush (same CPU)

3. Fewer threads:

// BAD: Too many threads
for (int i = 0; i < 1000; i++) {
pthread_create(&threads[i], NULL, worker, NULL);
}
// Excessive context switching!

// GOOD: Thread pool (# of CPUs)
int n_threads = sysconf(_SC_NPROCESSORS_ONLN);
for (int i = 0; i < n_threads; i++) {
pthread_create(&threads[i], NULL, worker, NULL);
}

4. User-space threading (fibers, coroutines):

// No kernel involvement
// No mode switch
// No TLB flush
// But: Can't utilize multiple CPUs

Measuring Context Switches

# Number of context switches
cat /proc/$PID/status | grep ctxt
# voluntary_ctxt_switches: 123456
# nonvoluntary_ctxt_switches: 7890

# System-wide
vmstat 1
# cs column = context switches per second

# Trace context switches
perf record -e context-switches ./program
perf report

Interrupt Latency

Interrupt latency - interrupt baş verəndən işlənənə qədər olan vaxt.

Interrupt Latency Components

1. Interrupt recognition: 1-10 cycles
- Hardware detects interrupt signal

2. Current instruction completion: 0-100 cycles
- Finish currently executing instruction

3. Context save: 10-50 cycles
- Push PC, flags, registers

4. ISR lookup: 10-20 cycles
- Index into interrupt vector table

5. ISR execution: Variable (10-10000+ cycles)
- Actual interrupt handler code

6. Context restore: 10-50 cycles
- Pop registers, flags, PC

7. Return: 5-10 cycles
- iret/eret instruction

Total: ~100-200 cycles (typical)
+ ISR execution time

Factors Affecting Latency

Measuring Interrupt Latency

# Use cyclictest (real-time Linux)
sudo cyclictest -p 99 -t 1 -n -i 1000 -l 100000

# Output:
# Min: 5 µs
# Avg: 12 µs
# Max: 847 µs (worst case - important for real-time!)

# Trace interrupts
sudo trace-cmd record -e irq ./program
trace-cmd report

Reducing Interrupt Latency

1. Shorter critical sections:

// BAD: Long interrupts-disabled section
local_irq_disable();
// ... 1000 lines of code ...
local_irq_enable();

// GOOD: Minimal critical section
local_irq_disable();
// Only essential code
local_irq_enable();

2. Nested interrupts:

void interrupt_handler() {
save_context();

// Re-enable interrupts (allow higher priority)
enable_interrupts();

// Handle interrupt
process_interrupt();

disable_interrupts();
restore_context();
}

3. Deferred work:

// Top half (fast, in ISR)
void irq_handler() {
// Minimal work
read_status();
schedule_work(&workqueue);
ack_interrupt();
}

// Bottom half (slower, in process context)
void workqueue_handler() {
// Heavy processing
process_data();
}

4. Real-time priority:

// Set real-time priority (Linux)
struct sched_param param;
param.sched_priority = 99; // Max priority
sched_setscheduler(0, SCHED_FIFO, &param);

// Reduce scheduling latency

5. Disable unnecessary interrupts:

// IRQ affinity (bind IRQ to specific CPU)
echo 1 > /proc/irq/IRQ_NUMBER/smp_affinity
// CPU 0 only

// Isolate CPUs for real-time tasks
// Boot parameter: isolcpus=1,2,3

CPU Frequency Scaling Impact

Impact on timing:

// Measuring execution time
clock_t start = clock(); // Wall-clock time
// ... code ...
clock_t end = clock();

// Problem: CPU frequency changes during execution!
// Solution: Use RDTSC (cycle counter) or disable frequency scaling

Cache Line Ping-Pong

Solution: Avoid false sharing (see Performance Optimization).

Best Practices Summary

Endianness

// Always use explicit byte order for network/file I/O
uint32_t network_value = htonl(host_value);

// Use fixed-size types
#include <stdint.h>
uint32_t value; // Not "unsigned int"

Portability

// Use appropriate types for pointers
uintptr_t ptr_as_int = (uintptr_t)pointer;

// Use size_t for sizes
size_t size = sizeof(data);

// Don't assume type sizes
static_assert(sizeof(long) == 8, "Expected 64-bit long");

Performance

// Minimize system calls
// Use buffered I/O
// Batch operations

// Reduce context switches
// Use thread pools
// Set CPU affinity

// Reduce interrupt latency
// Short critical sections
// Use deferred work

Debugging

# Check system info
lscpu
uname -m

# Monitor performance
perf stat ./program
vmstat 1
iostat 1

# Trace system calls
strace ./program

# Trace interrupts
trace-cmd record -e irq ./program

Əlaqəli Mövzular

  • CPU Architecture: Privilege levels, instruction sets
  • Memory Hierarchy: Address spaces, paging
  • Performance: Context switch overhead, interrupt latency
  • I/O Systems: System calls, interrupt handling
  • Security: Mode switching, privilege separation