Graphics processing units (GPUs) are very common and allow for efficient, parallel processing of floating point numbers.
PPUs (Physics Processing Units) used to be a buzzword several years ago but never really caught on and this kind of calculation is now handled by GPUs as well.
Both of these types of hardware are specialized in the processing of a specific kind of data.
Looking at the kinds of applications being developed worldwide, it seems that a great many of them have to do with text processing. Particularly when web development is considered.
It seems to me that it would make sense to implement basic string operations (concatenation, reversal, maybe search, character-level operations such as replacement) at hardware level, possibly making those operations orders of magnitude faster than they are when performed on a typical CPU.
Does this sort of hardware exist? If it doesn’t, what makes it infeasible? Varying length? Arbitrary encoding? I believe these are a bit similar to what we’ve got in case floating point numbers (varying precision, different encodings possible).
I know some vendors sell devices specifically designed to handle text processing/search. The Google Search Appliance is an example thereof but as far as I know, it’s just a normal computer with dedicated software and the hardware does not feature anything as exotic as a processor specifically design to crunch text.
7
(Disclaimer: I don’t have information on the percentages of occurrence of various elementary string operations found in common software. The following answer is just my two cents of contribution to address some of your points.)
To give a quick example, Intel provides SIMD instructions for accelerating string operations in its SSE 4.2 instruction set (link to wikipedia article). Example of using these instructions to build useful language-level string functions can be found on this website.
What do these instructions do?
- Given a 16-byte string fragment (either 16 counts of 8-bit characters or 8 counts of 16-bit characters),
- In “Equal Each” mode, it performs an exact match with another string fragment of same length.
- In “Equal Any” mode, it highlights the occurrence of characters which match a small set of characters given by another string. An example is “aeiouy”, which detects the vowels in English words.
- In “Range comparison” mode, it compares each character to one or more character ranges. An example of character range is
azAZ
, in which the first pair of characters specifies the range of lower-case English alphabets, and the second pair specifies the upper-case alphabets. - In “Equal ordered” mode, it performs a substring search.
(All examples above are taken from the above linked website, with some paraphrasing.)
Before beginning a discussion on this topic, it is necessary to gather the prerequisite knowledge needed for such discussion.
It is assumed that you already have college-level introductory knowledge of:
- CPU architecture
- Digital design and synthesis, which teaches introductory level of Hardware Description Language, such as Verilog or VHDL.
- And finally, a practicum project where you use the above knowledge to build something (say, a simple 16-bit ALU, or a multiplier, or some hardware input string pattern detection logic based on state machine), and perform a cost counting (in logic gates and silicon area) and benchmarking of the things being built.
First of all, we must revisit the various schemes for the in-memory representation of strings.
This is because the practicum in hardware design should have informed you that a lot of things in hardware had to be “hard-wired”. Hardware can implement complex logic but the wiring between those logic are hard-wired.
My knowledge in this aspect is very little. But just to give a few examples, rope, “cord”, “twine”, StringBuffer
(StringBuilder
) etc. are all legit contenders for the in-memory representation of strings.
Even in C++ alone, you still have two choices: implicit-length strings (also known as null-terminated strings), and explicit-length strings (in which the length is stored in a field of the string class, and is updated whenever the string is modified).
Finally, in some languages the designer has made the decision of making string objects immutable. That is, if one wish to modify a string, the only way to do so is to:
- Copy the string and apply the modification on-the-fly, or
- Refer to substrings (slices) of the original immutable string and declare the modifications that one wish to have applied. The modification isn’t actually evaluated until the new result is consumed by some other code.
There is also a side question of how strings are allocated in memory.
Now you can see that, in software, a lot of wonderful (or crazy) design choices exist. There has been lots of research into how to implement these various choices in hardware. (In fact, this has been a favorite way of formulating a master’s thesis for a degree in digital design.)
All this is fine, except that due to economic reasons, a hardware vendor cannot justify the cost of supporting a language/library designer’s crazy ideas about what a string should be.
A hardware vendor typically has full access to every master’s thesis written by every student (in digital design) in the world. Thus, a hardware vendor’s decision of not including such features must be well-informed.
Now let’s go back to the very basic, common-sense question: “What string operations are among the most-frequently performed in the typical software, and how can they benefit from hardware acceleration”?
I don’t have hard figures, but my guess is that string copying verbatim is probably the #1 operation being performed.
Is string copying already accelerated by hardware? It depends on the expected lengths of strings. If the library code knows that it is copying a string of several thousand characters or more, without modification, it could have easily converted the operation into a memcpy
, which internally uses CPU SIMD (vectorized instructions) to perform the memory movement.
Furthermore, on these new CPU architectures there is the choice of keeping the moved string in CPU cache (for subsequent operations) versus removing it from the CPU cache (to avoid cache pollution).
But how often does one need to copy such long strings?
It turns out that the standard C++ library had to optimize for the other case:
- https://stackoverflow.com/questions/21694302/what-are-the-mechanics-of-short-string-optimization-in-libc
That is, strings with lengths in the low-ten’s occur with such high frequency that special cases have to be made to minimize the overhead of memory management for these short strings. Go figure.
3
There are many ways to operate on strings in parallel. There are embarrassingly parallel problems in manipulating and analyzing text data-breaking it down into many strings, for example, spread across threads in multicore processors is a very obvious way. How does Google, for example, wade through all the text it does? In parallel.
There are many search algorithms Wikipediawhich can benefit from comparing several characters at a time. Knowing the length of a word, for example, allows you to, if it fits in your vector. Xeon Phi processors have if I’m not mistaken 4 threads per processor, and 57 or 60 processor cores on a chip, and each processor has a 512 bit vector unit. Intel-xeon phi
It can process sixteen 32 bit wide integers, so you could process the largest format Unicode characters. If your pattern fits, simply start by looking out for punctuation and whitespace spaced the right distance apart. Rule out a word by checking for several, overlapping patterns to see if it’s in there and gauge whether that chunk may be overlapping your pattern.
57 or 60 cores, times four threads using 512 bit vectors is probably a very powerful tool for manipulating text-in parallel if you have big chunks of text or multiple streams.
Sources