Memory read/write access efficiency

I’ve heard conflicting information from different sources, and I’m not really sure which one to believe. As such, I’ll post what I understand and ask for corrections. Let’s say I want to use a 2D matrix. There are three ways that I can do this (at least that I know of).

int i;
char **matrix;
matrix = malloc(50 * sizeof(char *));
for(i = 0; i < 50; i++)
    matrix[i] = malloc(50);

int i;
int rowSize = 50;
int pointerSize = 50 * sizeof(char *);
int dataSize = 50 * 50;
char **matrix;
matrix = malloc(dataSize + pointerSize);
char *pData = matrix + pointerSize - rowSize;
for(i = 0; i < 50; i++) {
    pData += rowSize;
    matrix[i] = pData;
}

//instead of accessing matrix[i][j] here, we would access matrix[i * 50 + j]
char *matrix = malloc(50 * 50);

In terms of memory usage, my understanding is that 3 is the most efficient, 2 is next, and 1 is least efficient, for the reasons below:

3: There is only one pointer and one allocation, and therefore, minimal overhead.

2: Once again, there is only one allocation, but there are now 51 pointers. This means there is 50 * sizeof(char *) more overhead.

1: There are 51 allocations and 51 pointers, causing the most overhead of all options.

In terms of performance, once again my understanding is that 3 is the most efficient, 2 is next, and 1 is least efficient. Reasons being:

3: Only one memory access is needed. We will have to do a multiplication and an addition as opposed to two additions (as in the case of a pointer to a pointer), but memory access is slow enough that this doesn’t matter.

2: We need two memory accesses; once to get a char *, and then to the appropriate char. Only two additions are performed here (once to get to the correct char * pointer from the original memory location, and once to get to the correct char variable from wherever the char * points to), so multiplication (which is slower than addition) is not required. However, on modern CPUs, multiplication is faster than memory access, so this point is moot.

1: Same issues as 2, but now the memory isn’t contiguous. This causes cache misses and extra page table lookups, making it the least efficient of the lot.

First and foremost: Is this correct? Second: Is there an option 4 that I am missing that would be even more efficient?

I assume that you mean the HPC environment as indicated in comment. I assume you mean also that matrix is large. As general rules

In most cases you can preallocate the data at the beginning anyway so the cost of the overhead is minimal – why bother with ms timing when your program is running for days/hours anyway running over and over this array?
In most cases you allocate large amount of data – malloc overhead for such data is minimal – why bother with few kB of overhead when you deal with GB of data
In most cases programmers are bad at timings. So the rule is to benchmark as much as possible to get the real times and overheads (though there are a few papers on modelling HPC systems you might want to read – models are extremely useful too and can be relatively simple for memory bound problems)

Regarding the specified example – it largely depend on how you access the data. In many cases you want to co-design the data structure for specific algorithm and for specific platform. In most cases you want to exploit memory hierarchy (from slowest to fastest: network, disk, main memory, cache, registers) as much as possible which means that you want to use spatial and temporal locality. Not getting into details it means that you want to:

Access data that is near data you already accessed (spatial locality)
Reuse data you already accessed (temporal locality)

That depends on your algorithm – you might for example use space filling curve if you perform cellar automata or stencil like operations (for example Z curve). The exact details depend on your platform (cache associativity, streaming auto detection etc.).

Properly exploiting it can have a dramatic speedup just when tweaked in simple way. For example matrix multiplication simply reordering loops can give speedup of several orders of magnitude (try reorder loops yourself, bonus question – if you change a size so you do not align with cache line what happens?):

const size_t size = 4096;
double (*a)[size] = malloc(sizeof(double [size]));
double (*b)[size] = malloc(sizeof(double [size]));
double (*c)[size] = malloc(sizeof(double [size]));
for (int i = 0; i < size; i++) {
    for (int j = 0; j < size; j++) {
        c[i][j] = 0;
        for (int k = 0; k < size; k++) {
            c[i][j] += a[i][k] * b[k][j];
        }
    }
}

Then you can further employ such techniques as loop tiling etc. Please also note that when you parallelize the code you need to avoid such things like false sharing i.e. two threads accessing data on same cache line will cause it to be sent back and forth between processors, which is slow (for Core i3/i5/i7 architecture, at least the newer ones, it downgrades the access to memory to L3 cache speed IIRC).

For more complex structure then double you also need to consider AOS vs SOA structure. The first one is ‘normal’ C/C++ array of structures and it’s good if you can exploit cache locality – when you access first element the rest is likely to be in cache too. The second one keeps each field of structure in separate array – while it has cost of worse cache locality it improves the vectorization/SPMD as the loading of single field happens in single step and improves streaming.

The good news is that for some cases, like linear operations, there are well-know libraries/APIs, such as BLAS.

Bottom line is that probably variation of option 3 is the best way. Depending on your operations you may want to pad the data to full cache line or order a data in clever way. However probably you might want to just use BLAS library from your HPC vendor if all you do is the linear operations.

Option 2-like operation are probably employed mostly when you have unstructured access.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 02:57

Thẻ: c++, optimization

Thiết kế website giá rẻ

Danh mục

Memory read/write access efficiency