Performans Optimizasiyası

Performance Bottlenecks

CPU-bound vs Memory-bound

CPU-bound Workload

Çox hesablama, az memory access.

// CPU-bound: Heavy computation
double complex_math(double x) {
    double result = x;
    for (int i = 0; i < 1000; i++) {
        result = sin(result) * cos(result) + sqrt(result);
    }
    return result;
}

Xarakteristika:

High CPU utilization (90-100%)
Low cache miss rate
Bottleneck: CPU speed, instruction throughput

Optimizasiya:

Algoritmik təkmilləşdirmə
SIMD instructions
Multi-threading
Better CPU (higher frequency, more cores)

Memory-bound Workload

Az hesablama, çox memory access.

// Memory-bound: Lots of memory access
long sum_array(int* arr, int n) {
    long sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];  // Memory access each iteration
    }
    return sum;
}

// Worse: Random access (pointer chasing)
long sum_linked_list(Node* head) {
    long sum = 0;
    while (head) {
        sum += head->value;  // Cache miss likely!
        head = head->next;   // Random memory access
    }
    return sum;
}

Xarakteristika:

Low CPU utilization (waiting on memory)
High cache miss rate
Bottleneck: Memory bandwidth, latency

Optimizasiya:

Cache-friendly data structures
Prefetching
Memory layout optimization
Reduce memory access

Identifying Bottleneck

# CPU performance counters
perf stat -e cycles,instructions,cache-misses,cache-references ./program

# Output example:
# Performance counter stats for './program':
#
#   1,234,567,890  cycles
#   2,000,000,000  instructions       # 1.62 insns per cycle
#      10,000,000  cache-misses       # 5% of all cache refs
#     200,000,000  cache-references

Metrics:

IPC (Instructions Per Cycle):
- IPC > 2: CPU-bound (good utilization)
- IPC < 1: Memory-bound (waiting on memory)

Cache miss rate:
- < 1%: Good locality
- > 10%: Poor locality, memory-bound

Cache-Friendly Code

1. Spatial Locality

Access data sequentially (cache line friendly).

// BAD: Column-major access (cache unfriendly)
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];  // Jump by row size each time
    }
}

// GOOD: Row-major access (cache friendly)
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];  // Sequential access
    }
}

Performance difference:

N = 1024
Cache line size = 64 bytes = 16 ints

Row-major:
- 1 cache miss per 16 elements
- Cache miss rate: 6.25%

Column-major:
- 1 cache miss per element
- Cache miss rate: 100%

Speedup: 10-50× faster!

2. Temporal Locality

Reuse data while it's still in cache.

// BAD: Poor temporal locality
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        C[i][j] = A[i][j] + B[i][j];
    }
}
// Later...
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        D[i][j] = C[i][j] * 2;  // C may not be in cache anymore
    }
}

// GOOD: Fusion - process immediately
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        C[i][j] = A[i][j] + B[i][j];
        D[i][j] = C[i][j] * 2;  // C is still in cache!
    }
}

3. Loop Tiling (Blocking)

Divide data into cache-sized chunks.

// BAD: Matrix multiplication (cache unfriendly for large N)
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

// GOOD: Tiled matrix multiplication
#define TILE 32  // Fit in L1 cache
for (int ii = 0; ii < N; ii += TILE) {
    for (int jj = 0; jj < N; jj += TILE) {
        for (int kk = 0; kk < N; kk += TILE) {
            // Process TILE×TILE block
            for (int i = ii; i < ii + TILE; i++) {
                for (int j = jj; j < jj + TILE; j++) {
                    for (int k = kk; k < kk + TILE; k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}

Performance:

N = 1024, no tiling:     10 seconds
N = 1024, tiling:        2 seconds   (5× faster!)

Data Structure Alignment

Cache Line Alignment

// BAD: Struct spans multiple cache lines
struct Data {
    int a;
    int b;
    char c;
    // Padding...
    double d;  // May be in different cache line
};

// GOOD: Explicit alignment
struct __attribute__((aligned(64))) Data {
    int a;
    int b;
    char c;
    char padding[64 - sizeof(int)*2 - sizeof(char) - sizeof(double)];
    double d;
};

Multiple threads accessing different variables in same cache line.

// BAD: False sharing
struct Counter {
    int counter1;  // Used by thread 1
    int counter2;  // Used by thread 2
} counters;  // Both in same cache line!

void thread1() {
    counters.counter1++;  // Invalidates cache line for thread 2
}

void thread2() {
    counters.counter2++;  // Invalidates cache line for thread 1
}

Result: Constant cache line bouncing between cores → slow!

// GOOD: Pad to separate cache lines
struct Counter {
    int counter1;
    char padding1[60];  // Fill rest of cache line
    int counter2;
    char padding2[60];
} __attribute__((aligned(64)));

// Or use C11/C++11
alignas(64) int counter1;
alignas(64) int counter2;

Performance:

With false sharing:    100 million ops/sec
Without false sharing: 1000 million ops/sec  (10× faster!)

Array of Structures vs Structure of Arrays

// AoS (Array of Structures)
struct Particle {
    float x, y, z;     // Position
    float vx, vy, vz;  // Velocity
};
Particle particles[1000];

// Access position
for (int i = 0; i < 1000; i++) {
    process_position(particles[i].x, particles[i].y, particles[i].z);
    // Load entire struct (24 bytes) even though we only need 12 bytes
}

// SoA (Structure of Arrays)
struct Particles {
    float x[1000], y[1000], z[1000];
    float vx[1000], vy[1000], vz[1000];
} particles;

// Access position
for (int i = 0; i < 1000; i++) {
    process_position(particles.x[i], particles.y[i], particles.z[i]);
    // Sequential access, better cache usage, SIMD-friendly
}

SoA advantages:

Better cache utilization
SIMD-friendly (process 4-8 elements at once)
Less memory bandwidth

Prefetching

Software Prefetch

#include <xmmintrin.h>  // SSE intrinsics

void process_array(int* arr, int n) {
    for (int i = 0; i < n; i++) {
        // Prefetch future data
        _mm_prefetch(&arr[i + 64], _MM_HINT_T0);  // T0 = L1 cache
        
        // Process current data
        process(arr[i]);
    }
}

Prefetch hints:

_MM_HINT_T0  // Prefetch to all cache levels (L1, L2, L3)
_MM_HINT_T1  // Prefetch to L2 and L3
_MM_HINT_T2  // Prefetch to L3 only
_MM_HINT_NTA // Non-temporal (bypass cache)

Hardware Prefetch

Modern CPUs automatically detect sequential access patterns.

// Hardware prefetcher detects this pattern
for (int i = 0; i < n; i++) {
    sum += arr[i];  // Sequential → automatic prefetch
}

// Hardware prefetcher can't detect this
for (int i = 0; i < n; i++) {
    sum += arr[random[i]];  // Random → no prefetch
}

SIMD Instructions

Scalar vs SIMD

// Scalar: Process one element at a time
void add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // 1 operation
    }
}

// SIMD: Process multiple elements at once
#include <immintrin.h>  // AVX

void add_simd(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);  // Load 8 floats
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);  // Add 8 floats at once
        _mm256_store_ps(&c[i], vc);         // Store 8 floats
    }
}

Performance:

Scalar:  1 billion elements in 2.0 seconds
SIMD:    1 billion elements in 0.3 seconds  (6-7× faster)

SIMD Instruction Sets

Auto-Vectorization

Modern compilers can auto-vectorize simple loops.

void add(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

# GCC with auto-vectorization
gcc -O3 -march=native -ftree-vectorize -fopt-info-vec program.c

# Output: "loop vectorized"

Requirements for auto-vectorization:

Simple loop structure
No dependencies between iterations
Aligned data
Known loop bounds

SIMD Example: Dot Product

// Scalar
float dot_product_scalar(float* a, float* b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

// AVX
#include <immintrin.h>

float dot_product_avx(float* a, float* b, int n) {
    __m256 sum_vec = _mm256_setzero_ps();  // 8 zeros
    
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 prod = _mm256_mul_ps(va, vb);
        sum_vec = _mm256_add_ps(sum_vec, prod);
    }
    
    // Horizontal sum
    float result[8];
    _mm256_store_ps(result, sum_vec);
    float sum = 0.0f;
    for (int i = 0; i < 8; i++) {
        sum += result[i];
    }
    return sum;
}

Branch Prediction Optimization

1. Reduce Branches

// BAD: Branch in loop
for (int i = 0; i < n; i++) {
    if (arr[i] > threshold) {
        result[i] = arr[i] * 2;
    } else {
        result[i] = arr[i];
    }
}

// GOOD: Branchless (for unpredictable data)
for (int i = 0; i < n; i++) {
    int cond = arr[i] > threshold;
    result[i] = arr[i] * (1 + cond);  // Conditional move
}

2. Likely/Unlikely Hints

// GCC/Clang
if (__builtin_expect(error != 0, 0)) {  // Error is unlikely
    handle_error(error);
}

// C++20
if (error != 0) [[unlikely]] {
    handle_error(error);
}

3. Sort Data for Predictability

// BAD: Random order (unpredictable branches)
for (int i = 0; i < n; i++) {
    if (arr[i] > threshold) {
        sum1 += arr[i];
    } else {
        sum2 += arr[i];
    }
}

// GOOD: Sort first (predictable branches)
std::sort(arr, arr + n);
// Now: First half < threshold, second half > threshold
// Branch predictor happy!

Function Inlining

// BAD: Function call overhead
int add(int a, int b) {
    return a + b;
}

for (int i = 0; i < 1000000; i++) {
    result += add(x, y);  // Function call each iteration
}

// GOOD: Inline
inline int add(int a, int b) {
    return a + b;
}

// Or use macro (old style)
#define ADD(a, b) ((a) + (b))

// Compiler usually inlines automatically with -O2/-O3

Loop Optimizations

1. Loop Unrolling

// Original
for (int i = 0; i < n; i++) {
    sum += arr[i];
}

// Unrolled (4×)
int i;
for (i = 0; i < n - 3; i += 4) {
    sum += arr[i];
    sum += arr[i+1];
    sum += arr[i+2];
    sum += arr[i+3];
}
// Handle remaining elements
for (; i < n; i++) {
    sum += arr[i];
}

Benefits:

Fewer loop overhead instructions
More ILP (instruction-level parallelism)
Better for branch prediction

2. Loop Invariant Code Motion

// BAD: Compute inside loop
for (int i = 0; i < n; i++) {
    result[i] = arr[i] * sqrt(x) + y;  // sqrt(x) doesn't change!
}

// GOOD: Hoist out of loop
double sqrt_x = sqrt(x);
for (int i = 0; i < n; i++) {
    result[i] = arr[i] * sqrt_x + y;
}

3. Strength Reduction

// BAD: Expensive operation
for (int i = 0; i < n; i++) {
    result[i] = arr[i] * i;  // Multiplication
}

// GOOD: Replace with addition
int product = 0;
for (int i = 0; i < n; i++) {
    result[i] = arr[i] * product;
    product += 1;  // Addition instead of multiplication
}

Memory Allocation

1. Stack vs Heap

// Stack: Fast, automatic cleanup
void function() {
    int arr[1000];  // Stack allocation (~1 cycle)
    // ...
}  // Automatic deallocation

// Heap: Slow, manual cleanup
void function() {
    int* arr = malloc(1000 * sizeof(int));  // Heap (~100+ cycles)
    // ...
    free(arr);  // Manual deallocation
}

2. Memory Pools

// BAD: Frequent malloc/free
for (int i = 0; i < 1000; i++) {
    Node* node = malloc(sizeof(Node));
    // Use node
    free(node);
}

// GOOD: Pre-allocate pool
Node pool[1000];
for (int i = 0; i < 1000; i++) {
    Node* node = &pool[i];  // Just pointer arithmetic
    // Use node
    // No free needed
}

3. Huge Pages

// Regular pages: 4 KB
// Huge pages: 2 MB or 1 GB

// Allocate huge page (Linux)
void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

// Benefit: Fewer TLB misses

TLB entries:

4 KB pages:  512 entries cover 2 MB
2 MB pages:  512 entries cover 1 GB  (500× more coverage!)

Profiling and Measurement

1. Time Measurement

#include <time.h>

clock_t start = clock();
// Code to measure
clock_t end = clock();
double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
printf("Time: %.3f seconds\n", elapsed);

2. perf (Linux)

# Basic profiling
perf record ./program
perf report

# CPU events
perf stat -e cycles,instructions,cache-misses,branch-misses ./program

# Sample on-CPU functions
perf record -F 99 -g ./program
perf report --stdio

# Flamegraph
perf record -F 99 -g ./program
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

3. Intel VTune

# Hotspot analysis
vtune -collect hotspots ./program

# Memory access analysis
vtune -collect memory-access ./program

# Report
vtune -report hotspots

4. Profiler Example

// Insert in critical sections
#include <time.h>

#define PROFILE_START(name) \
    clock_t __start_##name = clock();

#define PROFILE_END(name) \
    clock_t __end_##name = clock(); \
    printf("%s: %.3f ms\n", #name, \
           (double)(__end_##name - __start_##name) / CLOCKS_PER_SEC * 1000);

// Usage
void my_function() {
    PROFILE_START(computation)
    // ... code ...
    PROFILE_END(computation)
}

Compiler Optimizations

Optimization Levels

# GCC/Clang
-O0  # No optimization (debugging)
-O1  # Basic optimization
-O2  # Recommended (balanced)
-O3  # Aggressive (may increase code size)
-Ofast  # -O3 + fast math (non-standard compliant)
-Os  # Optimize for size

Important Flags

# Architecture-specific optimizations
-march=native  # Use all CPU features

# Link-Time Optimization
-flto

# Profile-Guided Optimization (PGO)
# Step 1: Compile with instrumentation
gcc -fprofile-generate program.c -o program
# Step 2: Run with typical workload
./program < typical_input.txt
# Step 3: Recompile with profile data
gcc -fprofile-use program.c -o program

Best Practices Checklist

Code Level

Compilation

Use -O2 or -O3 optimization level
Enable -march=native for CPU-specific optimizations
Consider LTO (-flto)
Try PGO for critical applications
Profile and analyze assembly output

Profiling

Measure before optimizing (don't guess!)
Find hotspots (80/20 rule)
Use profiling tools (perf, VTune, gprof)
Check cache miss rates
Monitor IPC (instructions per cycle)
Verify optimizations actually help

Hardware

Use newer CPUs (better branch prediction, larger caches)
Ensure adequate RAM (avoid swapping)
Use SSDs for I/O-bound workloads
Consider NUMA topology for multi-socket systems

Common Pitfalls

Premature optimization

"Premature optimization is the root of all evil" - Donald Knuth
First: Make it work
Then: Make it fast (if needed)

Optimizing wrong part

Profile first! Optimize the 5% that matters, not the 95% that doesn't.

Micro-optimizations that don't matter

// Wasting time on:
i++ vs ++i  // Compiler optimizes this away

// Should focus on:
Algorithmic complexity (O(n²) → O(n log n))

Breaking correctness

// Don't sacrifice correctness for speed
// Test after every optimization!

Ignoring compiler warnings

# Always compile with warnings
gcc -Wall -Wextra -Werror program.c

Əlaqəli Mövzular

CPU Architecture: Pipeline, ILP
Cache Memory: Cache optimization
Memory Hierarchy: Locality principles
Branch Prediction: Reducing mispredictions
Parallelism: Multi-threading, SIMD
Profiling Tools: perf, VTune

Performance Bottlenecks​

CPU-bound vs Memory-bound​

CPU-bound Workload​

Memory-bound Workload​

Identifying Bottleneck​

Cache-Friendly Code​

1. Spatial Locality​

2. Temporal Locality​

3. Loop Tiling (Blocking)​

Data Structure Alignment​

Cache Line Alignment​

False Sharing​

Array of Structures vs Structure of Arrays​

Prefetching​

Software Prefetch​

Hardware Prefetch​

SIMD Instructions​

Scalar vs SIMD​

SIMD Instruction Sets​

Auto-Vectorization​

SIMD Example: Dot Product​

Branch Prediction Optimization​

1. Reduce Branches​

2. Likely/Unlikely Hints​

3. Sort Data for Predictability​

Function Inlining​

Loop Optimizations​

1. Loop Unrolling​

2. Loop Invariant Code Motion​

3. Strength Reduction​

Memory Allocation​

1. Stack vs Heap​

2. Memory Pools​

3. Huge Pages​

Profiling and Measurement​

1. Time Measurement​

2. perf (Linux)​

3. Intel VTune​

4. Profiler Example​

Compiler Optimizations​

Optimization Levels​

Important Flags​

Best Practices Checklist​

Code Level​

Compilation​

Profiling​

Hardware​

Common Pitfalls​

Əlaqəli Mövzular​

Performance Bottlenecks

CPU-bound vs Memory-bound

CPU-bound Workload

Memory-bound Workload

Identifying Bottleneck

Cache-Friendly Code

1. Spatial Locality

2. Temporal Locality

3. Loop Tiling (Blocking)

Data Structure Alignment

Cache Line Alignment

False Sharing

Array of Structures vs Structure of Arrays

Prefetching

Software Prefetch

Hardware Prefetch

SIMD Instructions

Scalar vs SIMD

SIMD Instruction Sets

Auto-Vectorization

SIMD Example: Dot Product

Branch Prediction Optimization

1. Reduce Branches

2. Likely/Unlikely Hints

3. Sort Data for Predictability

Function Inlining

Loop Optimizations

1. Loop Unrolling

2. Loop Invariant Code Motion

3. Strength Reduction

Memory Allocation

1. Stack vs Heap

2. Memory Pools

3. Huge Pages

Profiling and Measurement

1. Time Measurement

2. perf (Linux)

3. Intel VTune

4. Profiler Example

Compiler Optimizations

Optimization Levels

Important Flags

Best Practices Checklist

Code Level

Compilation

Profiling

Hardware

Common Pitfalls

Əlaqəli Mövzular