Data structure for pattern matching

Let’s say you have an input file with many entries like these:

date, ticker, open, high, low, close, <and some other values>

And you want to execute a pattern matching routine on the entries(rows) in that file, using a candlestick pattern, for example. (See, Doji) And that pattern can appear on any uniform time interval (let t = 1s, 5s, 10s, 1d, 7d, 2w, 2y, and so on…).

Say a pattern matching routine can take an arbitrary number of rows to perform an analysis and contain an arbitrary number of subpatterns. In other words, some patterns may require 4 entries to operate on others may require more or less.

Say also that the routine (may) later have to find and classify extrema (local and global maxima and minima as well as inflection points) for the ticker over a closed interval, for example, you could say that a cubic function (x^3) has the extrema on the interval [-1, 1]. (See link)

Given that I don’t know the size of the input file upon reading? What would be the most natural choice in terms of a data structure? What about an interface that conforms a Ticker object containing one row of data to a collection of Ticker so that an arbitrary pattern can be applied to the data. For example, let’s say I have a pattern that requires n elements to form a match, I need to apply the pattern to those elements and then analyze each one of those chunks.

What’s the first thing that comes to mind?

I chose a doubly-linked circular linked list that has the following methods:

push_front()
push_back()
pop_front()
pop_back()
[] //overloaded, can be used with negative parameters

But that data structure seems very clumsy, since so much pushing and popping is going on, I have to make a deep copy of the data structure before running an analysis on it.

So, I don’t know if I made my question very clear — but the main points are:

What kind of data structures should be considered when analyzing sequential data points to conform to a pattern that does NOT require random access?
What kind of data structures should be considered when classifying extrema of a set of data points?

Updated

Here’s the main analyzer loop for my program (in some pseudocode that’s a polyglot of a few languages)

data = new circular_linkedList()
// in my case, data is a circular linked list of elements.
foreach(element in dataFeed) {
    data.push_front(element)
}

step_value = Pattern.numberRequiredElementsForMatch()
buffer = new circular_linkedList()
// at this point, you'll have a circular linked list
// where pop_back() will return the oldest entry and 
// entry, and pop_front() will return the newest

// initialize the buffer
for(value in range(step_value()) {
    buffer.push_back(data.pop_back())
    // fill a buffer with the number of
    // values required for a match. 
    // so now buffer will contain
    // [elementk, ... , elementk+n]
    // where buffer[0] < ... < buffer[n]
}
do {
    last = buffer[len(buffer)]
    ...
    first = buffer[0]

    Pattern.match(first, ... , last)
    buffer.pop_front() // remove the first element
    buffer.push_back(data.pop_back()) //add another element to the back


} while (!data.isEmpty())

...

Anyway, that’s a basic idea of what I have going on. The problem is that the way I’m doing it now, I have to reiterate through the loop to apply another pattern. (buffer sizes differ) It just seems inefficient to do that. Also, what happens when there’s no more data in the buffer and it’s not evenly divisible by the number of elements needed by a match?

Vector instead of Linked List

Whatever you do, don’t use a doubly linked list for this. It has a lot of allocation and space overhead. The data is read and appended sequentially, there is no random access or random removal of items, so a vector-like data structure would be much more suitable to store data. But even that is overkill for the raw data, as you’ll see below.

Data flows through Listeners, then disappears

To do linear pattern matching you don’t have to store the data at all, and you’ll only have to traverse it once. The idea is to have several matchers listening for their patterns as the data hums along. These store only the data needed to detect a pattern. Any items that cannot be part of a pattern anymore will be forgotten.

I’ll describe one way of achieving that. I must warn you that the task you want to perform is not trivial to do efficiently. Judging from your proposal of using a linked list, it may take some time to wrap your head around the principles involved.

Continuously register Matchers listening to the data

Let’s start by adding some entities to listen for your patterns in the data. Register a matcher factory for every combination you want to recognize. Typically each pattern would be in its own matcher class, parametrized by the resolution it is looking for.

Now start reading in the data, and feed each item to each matcher factory as you read it. Then have that factory instantiate/reuse a matcher for every place that could be the starting point of a pattern. For example, a matcher with a 7 day resolution could be instantiated each time the first data point for the new week comes in.

Matchers update internal state until they reject/accept a pattern

The ticker items are also fed to each active matcher. Each matcher should track its own internal matching state. For example, a matcher with a 7 day resolution may be accumulating ticker values to calculate the 7 day average. After each 7 days pass, it stores the average in the next position of an array, starting a new average accumulation for subsequent incoming ticker items. This continues until it has seen enough weeks to either reject the or confirm the presence of the pattern. To get some ideas on how to do this, look into ‘Finite State Machines’.

Efficiency gains by eliminating duplicate calculations

Of course, if there are multiple matchers that need the data on a 7-day resolution it is not efficient to have each one calculate it on its own. You may build a hierarchy of matchers so intermediate patterns only have to be calculated once. Look into ring buffers for ideas on how to maintain rolling averages (or other aggregate functions)

Related: Parser Generators

So-called ‘Parser Generators’ do a similar thing automatically for formal grammars. The generated parsers employ a finite state machine to detect hundreds of patterns with about the same effort it would take to recognize just one, and in just one pass of the source data. I imagine such tools may also exist for continuous time-series data, or you could transform their ideas to apply them to your problem.

Hope this helps!

Since you said you’re reading the data from a file, my first choice for the underlying data structure would be an array. It’s the most compact form of storage, which has advantages with cache, and arrays work very well when your data size is fixed and known ahead of time. Don’t get thrown off by not “needing” random access. Random access data structures can easily be substituted for sequential-only data structures, even though the reverse isn’t true.

I would also recommend some sort of View object into the array, which provides an interface that better matches your needs. It can handle internally the details of the starting point of the pattern, the time interval, etc., so your pattern matching functions can have a very simple interface, something like:

bool pattern(View &data) {
     return data[1].high > data[0].high;
}

Your pattern matching function doesn’t care if data[0] and data[1] are one second or one year apart. It doesn’t care if data[0] is the first row in your file or the millionth. The View object abstracts those details away. You just instantiate different View objects as needed, but they all share the same underlying array.

If the extrema are used as part of the pattern matching, you can calculate them once and store it in the View object, and use them like this:

bool pattern(View &data) {
     return (data.max() - data.min()) > 1.0;
}

In summary, the idea is to decouple your data structure as much as possible from the matching functions. You want those functions to read as close as possible to how a domain expert would express them in English. Note I can easily just change the View object, if I need to match patterns streaming in on a socket instead of from a file. The “meat” of the pattern matching doesn’t need to change.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 21:12

Thẻ: algorithms, data-structures, pattern-matching

Thiết kế website giá rẻ

Danh mục