Regular Expressions and Language Generators

I have few CS textbooks with me which discuss languages, well actually 2 plus old course notes supplied a few years ago. I have been searching the web too any only seem to come up with vague responses just like the text books I have.

My question is about language recognisers verses generators.

I get the overriding principle of a recogniser. That it analyses a language and is able to determine nay or yay if a String belongs to a language. This is at least what I have picked up from the books and notes. However, it’s much more complex than that is it not? A tokeniser and syntax analyser (which I assume to be recognisers) do not just say yes or no, they also say where don’t they…?

However, language generators. No one seems to be very clear about what they are. The typical description I get is For example Sebasta’s Concepts of Programming Languages says ‘A language generator is a device that can be used to generate the sentences of a language. We can think of a generator as a push button that produces a sentence of a language every time it is pressed.’ Seriously? That’s it?? Your kidding right….

I read that Regex is an example of a Generator, then why when people talk of generators do they not talk of the inputs. For example Regex has a target String, and the Regex with defines both the accepted alphabet and it’s grammar rules.

Can someone provide for me a clearer distinction of what a recognizer is?

I’m surprised you didn’t find your answer in Sebesta’s Concepts of Programming Languages… In short:

Recogniser: sentences -> grammar. Application: to check if given sentences are part of the language grammar. This is an essential part of a compiler or interpreter.

Generator: grammar -> sentences. Application: to provide examples of valid sentences in the language. This is an essential part of a programming language manual.

For instance, the user manual for a certain programming language may explain in words what a valid assignment looks like, and even provide a BNF rule (something like: assign -> var = expr), but it will also include examples of valid assignments so that the programmer understands better (e.g., a = b or x = y + 2*z). Such examples are generated from the BNF through a process known as derivation.

First off, these are not difficult subjects in their general sense, so do not be surprised that generating sentences of a language is it.

There is one concept shared by both a language recognizer, as well as a language generator, and that is a language.

Languages can be defined as simple as L = {a, b, c} where the language L only knows three words/sentences. Regular expressions themselves are neither recognizers nor generators. Instead, a regular expression is a way of defining a language. For example, the regular expression a+ defines the language L' = {a, aa, aaa, aaaa, ... }.

Now if you look into those textbooks of yours, I am sure, there is a section on how to convert a regular expression into a (non-)deterministic finite automaton. Such an automaton is now a language recognizer, that fails to accept a b, but does accept aaa when constructed from the regular expression a+.

On the other hand, a language generator is something that will output one element of L' at a time, eventually, generating all of L'. For example, a generator for L' could give you aaa, then a, then aa, then aaaa, and so on.

Generally, languages are infinitely large, hence, generators need to generate an infinite number of sentences in that language. Consider the language L1 defined by the regular expression a+b+. If you have a generator that outputs ab, aab, aaab, ..., then it can run infinitely long, but you will never ever see a word that has more than one b. Such generators tend to be rather useless due to their biased nature.

Therefore, to reduce the “that’s it” effect, we typically demand from a generator, that given any word from the language, the generator eventually produces it. So you could pick aabb and your generator must have a finite number of steps that it needs to produce that word. It cannot just infinitely increase the number of as anymore.

Now, to start making things really complicated, assume a more interesting language like, for example, Java. When you have a Java language generator, then this thing will eventually produce all of the Java classes that you have written and will ever write in your life as a Java programmer. I leave it as an exercise to you to think about how to realize such a thing – and do not forget: only valid Java programs are allowed. It must not output gibberish, but only sentences of the language.

Usage of language generators is primarily in CS theory. For practical purposes their infinite nature makes any kind of meaningful usage rather problematic. Apart from pathological cases, where the language is trivial and finite, one could use these generators for testing tools that are supposed to process sentences of the language. For example, you could test a Java parser against a Java generator by having it continuously parse the generated progarms.

There are actually three classes of automata that (mostly) get lumped together, much in the way you described: Generators, Recognizers and Transducers. They can all be accounted for in a very general framework that ties directly into Monoidal Categories, each of the basic properties of which yields a parallel that has something important to say about how these three types of automata are related to one another – including one property (“The Duality Theorem”) that has (as best as I’m aware) no coverage in the literature (yet). The Duality Theorem, in turn, provides a somewhat indirect way to swap the roles of input and output, as well a way to swap the roles of Generator and Recognizer, so that you can more or less treat them interchangeably.

Directly addressing your question, examples of language generators include the output generators used in the POSIX-standard utilities yacc, lex, indent, as well as the back ends of any compiler. Libraries for generating output back-ends include LLVM and libfirm (on GitHub). I’m not as familiar with natural language generators, though the best-known ones that have crept up in the past few years include Generative Pre-Trained Transformers (GPT) and LLaMa, which are both Large Language Models (LLM).

The LLM’s are not grammar-based, but entirely statistically-based; with the non-linear statistical models encapsulated in the form of neural networks; which is essentially just saying that it’s a non-linear statistical model, since such non-linear models (such as Non-Linear Regression, Radial Basis Functions) generally already tend to resemble neural nets in terms of their structure.

I’m not (yet) aware of what research fuses statistical / neural net based language generators with symbolic / grammar / logic based language generators, to synergize between the two, though such research almost certainly exists.

Transducers Versus Recognizers Versus Generators
The most general class, which includes all translators such as compilers, include utilities such as yacc and lex (which, technically, are compilers) or language processors, like indent, are transducers, which are associated with an input alphabet X and output alphabet Y. The elements of the alphabets could be symbolic or (particularly in the case of the output alphabet) kinetic: the symbols standing for actions. So, transducers actually include player pianos, drones that operate under guidance and other guided embedded systems.

In the special case where Y = ∅, the transducer is input-only and corresponds to a … Recognizer. For the special case where X = ∅, it is output-only and corresponds to a … Generator.

This generalizes to the case where there are three or more channels/tiers, each with its own alphabet, including cases where the input and output channels are, themselves, structured. Ultimately, that leads to what I referred, above, to as the “Duality Theorem”.

Product Monoids
The alphabets X and Y yield word algebras X* and Y* each endowed with a product (a,b ↦ ab), representing “concatenation” and identity (1) representing the “empty word”, that satisfies the properties

    1 x = x = x 1,  x (y z) = (x y) z

Such an algebra is called a Monoid, and the particular cases mentioned, M = X* and N = Y* are Free Monoids by virtue of the fact that the words in the algebras X* and Y* satisfy no properties other than those given by or derived from the basic monoid properties.

The example of a non-free monoid of most direct relevance here, is the product M × N of two monoids M and N. Classically, transductions are characterized as sets of ordered pairs between two free monoids M = X∗ and N = Y∗, the former providing the set of possible inputs, the latter the set of possible outputs. The subsets of M × N give us descriptions of transductions between two monoids M and N. The restriction to direct products of free monoids, with the rational, context-free and Turing subsets yields, respectively, the rational transductions, push-down transductions (equivalently, simple syntax directed translations) and Turing transductions (equivalently, recursively enumerable relations between X∗ and Y∗).

Formally, the direct product M × N of monoids M and N is formed as the Cartesian product M × N = { (m, n): m ∈ M, n ∈ N}, with multiplication defined pointwise:

    (m, n) (m̅, n̅) = (m m̅, n n̅), for m, m̅ ∈ M, n, n̅ ∈ N.

It follows that

    (m, 1) (1, n) = (m, n) = (1, n) (m, 1), for m ∈ M, n ∈ N.

Therefore, the direct product M × N may be characterized as the common extension of the factor monoids M and N in which the elements of each commute with those of the other. In particular

    X* × Y* = X* ⊎ Y*, subject to the relations (x y = y x, for x ∈ X, y ∈ Y).

where ⊎ denotes “disjoint union”. (Therefore, X* × Y* is not a free monoid.)

A disjoint union can be effected by tagging each m ∈ M as (m, 1), and each n ∈ N as (1, n). So, more precisely, we have two morphisms:

    ⊥: m ∈ M ↦ (m, 1) ∈ M × N,      ⊤: n ∈ N ↦ (1, n) ∈ M × N,

that are mutually commuting

    ⊥(m) ⊤(n) = ⊤(n) ⊥(m), for m ∈ M, n ∈ N.

As a reminder: a [monoid homo-]morphism φ: M₀ → M₁ between monoids M₀ and M₁ is a map that preserves the monoid structure, i.e. φ(1) = 1 and φ(m₀ m₁) = φ(m₀) φ(m₁), for m₀ ∈ M₀ and m₁ ∈ M₁.

Associated with the monoid product is the “universal property”, which associates with any mutually commuting morphisms

    f: M → P, g: N → P

to a monoid P, a unique constructor morphism

    [f, g]: (m, n) ∈ M × N ↦  f(m) g(n) = g(n) f(m) ∈ P

that satisfies the reconstruction identities [f, g]◦⊥ = f and [f, g]◦⊤ = g. The uniqueness property is equivalently captured by the uniqueness identities h◦[f, g] = [h◦f, h◦g] and [⊥, ⊤] = 1_{M×N}, for any monoid morphism h: M × N → P. For the following, the maps ⊥ and ⊤ can be treated as just inclusions, and the product ⊥(m) ⊤(n) = ⊤(n) ⊥(m) as just the (mutually commuting) product m n = n m, with M × N being the simultaneous extension of the two factor algebras M and N.

Monoids As A Tensor Category
The monoids, taken in their entirety, along with the morphisms between them, together forms a Category: the category of monoids. The monoid of particular interest, here, that allows you to lump Generators and Recognizers together with Transducers, is the trivial monoid ???? = {0}, where the element 0 ∈ ???? plays the role of the identity “1”. Its main property is that it has a unique morphism to every other monoid:

    [M]: 0 ∈ ???? ↦  1 ∈ M

The monoid ???? is referred to as an “initial object” in the category of monoids, by virtue of this property. The uniqueness is equivalently given by the properties that f◦[M] = [N], for any morphism f: M → N and that [????] = I_???? is the identity map on ????.

In addition, the product operator × can be applied to morphisms so that from

    f: M → M',      g: N → N'

one may define

    f × g = [⊥◦f, ⊤◦g]: M × N → M' × N'.

The combination of the product constructions on monoids, and on morphisms, along with an initial object yields what is known as a Braided Monoidal (or Tensor) Category. Its chief properties are that it possesses the following isomorphisms

    Right Unitor: ρ_M: M × ???? → M
    Left Unitor: λ_N: ???? × N → N
    Associator: α_{MNP}: M × (N × P) → (M × N) × P
    Commutor: γ_{MN}: M × N → N × M

which can be readily constructed, along with their inverses, from ingredients already laid out, e.g.

    ρ_M ≡ [1_M, [M]]: M × ???? → M,    (ρ_M)⁻¹ ≡ ⊥: M → M × ????
    λ_N ≡ [[N], I_N]: ???? × N → N,    (λ_N)⁻¹ ≡ ⊤: N → ???? × N

with other constructions, similarly composed from the basic ingredients, for the commutors, associators and their inverses.

For clarification, the term “monoidal” in the name “monoidal category” is not directly linked to the category we’re considering here (the category of monoids), but is a separate use of the term “monoid” and applies broadly to a large range of other categories, besides the category of monoids.

Additional requirements are also readily verified, but will only be noted in passing, since they will not be used in the following. These are: that the isomorphisms be “natural”, e.g. f◦λ_N = λ_N’◦(f × 1_????), for morphisms f: N → N’, and g◦ρ_M = ρ_M’◦ (1_???? × g), for morphisms g : M → M’ and that the category be “coherent”, meaning that any two morphisms constructed out of ⊥, ⊤ and [,] with compositions and identities are equal if they have the same source and target. That’s the property that the “hexagon identities” diagrams is meant to capture.

The braided monoidal categories are distinguished from the more general family of monoidal categories, linked previously up above, which need not have the “commutor” nor any of its associated identities. The link to monoidal categories has the “Pentagon Diagram” and “Triangle Diagram” to capture the “coherence” property.

A General Framework For Transducers, Recognizers And Generators
A transducer T = (Q, I, F, H) over a monoid M × N consists of a state set Q, subsets I, F ⊆ Q respectively of initial and final states and a (possibly infinite) subset H ⊆ Q × M × N × Q. The relation → (or →_H to make H explicit) is defined on N × Q × M as the closure of the one-step relations q m → n q’ for (q, m, n, q’) ∈ H under applications of the following

    ????:      If α ∈ N × Q × M, then α → α
    ????:      If α → β and β → γ, then α → γ
    ????:      If α → β and (m, n) ∈ M × N, then n α m → n β m

The subset of M × N corresponding to the transducer is that given by

    L(T) = { (m, n) ∈ M × N: (∃i ∈ I)(∃f ∈ F) i m → n f }

For the case N = ????, the transducer reduces to a Recognizer for the monoid M, and we can separately define a “Recognizer” by removing N from the picture, taking:

    S = (Q, I, F, H) over a monoid N
    H ⊆ Q × M × Q
    → is defined on Q × M
    q m → q', for (q, m, q') ∈ H are the 1-step relations
    ????:      If α ∈ Q × M, then α → α
    ????:      If α → β and β → γ, then α → γ
    ????:      If α → β and m ∈ M, then α m → β m
    L(S) = { m ∈ M: (∃i ∈ I)(∃f ∈ F) i m → f }

For the case M = ????, the transducer reduces to a Generator for the monoid N, and we can separately define “Generators” by removing M
from the picture, taking:

    U = (Q, I, F, H) over a monoid N
    H ⊆ Q × N × Q
    → is defined on N × Q
    q → n q', for (q, n, q') ∈ H are the 1-step relations
    ????:      If α ∈ N × Q, then α → α
    ????:      If α → β and β → γ, then α → γ
    ????:      If α → β and n ∈ N, then n α → n β
    L(U) = { n ∈ N: (∃i ∈ I)(∃f ∈ F) i → n f }

with the other defining properties being unchanged.

The right unitors ρ_M: M × ???? → M express the correspondences between transducers with trivial outputs N = ???? versus recognizers, while the left unitors λ_N: ???? × N → N express the correspondences between transducers with trivial inputs M = ???? versus generators.

The Duality Theorem
A transducer may be equivalently expressed as a recognizer or generator, where the one-step transition qm →_H nq’, is replaced respectively by the read relation q (m, n) →_H^R q’ and write relation q →_H^W (m, n) q’. These conversions correspond to the isomorphisms

    (M × N) × ???? ≅ M × (N × ????) ≅ M × N ≅ (???? × M) × N ≅ ???? × (M × N).

Notice, by the way, that the commutors are not used anywhere, so we actually only need the structure of monoidal categories, here, rather than braided monoidal categories. Taking advantage of the monoidal category structure formed by the monoids, instead of proving the above results directly, we can prove a more general result: the one corresponding to the isomorphism (M × N) × P ≅ M × (N × P).

Duality Theorem
Let M, N, P be monoids and T₀ = (Q, I, F, H₀) and T₁ = (Q, I, F, H₁) be transducers, respectively, over (M × N) × P and M × (N × P) such that (q, (m, n), p, q’) ∈ H₀ if and only if (q, m, (n, p), q’) ∈ H₁. Then

    q (m, n₀) →_{H₀} p' q' if and only if q m →_{H₁} (n₀, p') q',
    L(T₀) = α_{MNP}(L(T₁)).

Proof:
For the one-step transitions and for ????, the assertion is trivial.

Therefore, we need only establish the result inductively for ???? and the restricted forms of ????.

For ????, suppose

    p q (m, n) →_{H₀} p" q" (m", n") →_{H₀} p' q' (m', n').

Then applying the inductive hypothesis to each step and applying T, we factor n = n₀ n”, n” = n₁ n’ and write

    (1, p) q m →_{H₁} (n₀, p") q" m",       (1, p") q" m" →_{H₁} (n₁, p') q' m'.

Multiplying the second transition on the left by n₀ (i.e. applying ????), and using ????, we get

    (1, p) q m →_{H₁} (n₀ n₁, p') q' m',    n = (n₀ n₁) n'.

For ????, suppose

    p̅ p q (m m̅, n n̅) →_{H₀} p̅ p' q' (m' m̅, n' n̅)

arises from

    p q (m, n) →_{H₀} p' q' (m', n'),       m̅ ∈ M, n̅ ∈ N, p̅ ∈ P.

Again, applying the inductive step, we have a factoring

    n = n₀ n',      (1, p) q m →_{H₁} (n₀, p') q' m'.

Applying ???? to this, we obtain the result

    (1, p̅ p) q m m̅ →_{H₁} (n₀, p̅ p') q' m' m̅,       n n̅ = n₀ (n' n̅),

thus rounding out the induction.

A Broader Framework For Multi-Channel Communications
Because of the Duality Theorem, we can treat a transducer from monoid M to monoid N as a recognizer or generator for the product monoid M × N … or vice versa.

For the more general result, given by the Duality Theorem, corresponding to the isomorphism

    (M × N) × P ≅ M × (N × P)

between monoids M, N and P, the interpretation that most naturally avails itself is to treat this as a dual view of communications between two agents, where the M “channel” corresponds to the internal channel of the sender, the P “channel” corresponds to the internal channel of the receiver, and the N “channel” is the shared communications link between the two. The transducer, itself, yields a description of the communications as seen from an Archimedean Vantage Point.

The first product (M × N) × P is the transducer from the perspective of the sender, where the communications channel is combined with the sender channel, which are both visible to the sender, while the receiver channel is viewed externally as output. The second product M × (N × P) is the transducer from the perspective of the receiver, with the communications channel combined with the receiver channel, both of which are visible to the receiver, while the sender channel is viewed externally as input.

Since this framework applies generically to arbitrary monoids, not just free monoids, more complex structures can be built up from this, e.g. networks with three or more agents, four or more channels. All of this is seamlessly woven together into a single framework, as laid out here.

Here, I describe a framework for Context-Free Grammars For Arbitrary Monoids in the process of addressing what “context-free” means. The cases of most immediate interest are the context-free subsets of the product monoid X* × Y*, which are equivalently described as push-down transductions between X and Y, and simple syntax directed translations between X and Y. This is the framework in which parsers and parsing theory ultimately resides.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 09:55

Thẻ: languages, regular-expressions

Thiết kế website giá rẻ

Danh mục

Regular Expressions and Language Generators