How to use HuggingFace’s Transformers.js to distill messy dictionary definitions down to a clean array of 1-3 word definitions?

Background

What pieces would need to be involved using Transformers.js to distill/summarize/clean dictionary definitions which are messy and full of “junk”, and return a JSON array of short, summarized definitions?

For example, here are about 8 Tibetan terms which I sent to the OpenAI model gpt-4o in Node.js:

{
  "ལན་ཆགས།": [
    "[2759] 1) gnod lan res 'khor gyi rgyun gnas pa/ ... srog gi lan chags/ ... tshe sngon gyi lan chags/ ... 2) (yul) 1/dka' tshegs/ ... 2/chags sgo / ...",
    "a particular type of generalized karma which designates a relationship across lives in which the roles of the parties are reversed (a former master becomes the servant to his former servant, etc.).  May also refer to the entities that seek retribution; divided into sha 'khon (those who bear malice towards our flesh) and rgyu 'khon (those who bear malice toward our property).  Epstein, Dissertation 81, 106.  Tucci, Religions 181.  bu tsha lan chags snyog pa la byams par byas kyang mgu' ba myi 'ong.  Zhi-byed Coll. II 304.7.",
    "retribution (answer indebtedness), karmic debts due to past actions, payment",
    "1) retribution gnod lan res 'khor gyi rgyun gnas pa; 2) misfortune, adversity, calamity; 3) door",
    "{lan chags kyi mgron} (guests who are) karmic creditors",
    "karmic debts incurred by. karmic creditor; karmic credit/ retribution/ a debt; disaster caused by karmic retribution/ fate; misfortune, adversity, calamity, retribution"
  ],
  "དག་པ་འཇིག་རྟེན་པའི་ཡེ་ཤེས": [
    "A term used by ZAntipa in his Hevajra sAdhana, for the ye shes in post-meditation phase, rjes thob (aftereffect of the AvikalpajñAna in meditative equipoise, mnyam gzhag).  Skt. zuddhalaukikajñAna."
  ],
  "སྨན་བཅོས་སྐོར་བསྐྱོད": [
    "mobile med. care"
  ],
  "སྨན་བཅོས་ཁང༌": [
    "clinic",
    "*, dr.s consulting room"
  ],
  "སྨན་བཅོས་འཕྲོད་བསྟེན": [
    "public health, health care"
  ],
  "སྨན་བཅོས་འཕྲོད་བསྟེན་སྡེ་ཁག": [
    "health clinics"
  ],
  "སྨན་བཅོས་བྱེད": [
    "cure, remedy, treat"
  ],
  "སྨན་བཅོས་བྱེད་པ": [
    "cure, remedy, treat",
    "*, examine a patient",
    "to *"
  ],
  "སྨན་བཅོས་མི་སྣ": [
    "med. personnel"
  ]
}

Notice how each array item is messy, uses abbreviations, or has non-English “junk” in it like {lan chags kyi mgron} (guests who are) karmic creditors does. I asked OpenAI to simplify and distill the definitions to lowercased 1-3 word terms/phrases (if possible, otherwise make the definitions longer). I didn’t pass the Tibetan term, so it wouldn’t try to define it on it’s own. Here’s is basically what I sent to the OpenAI API:

const completion = await openai.chat.completions.create({
  messages: [
    {
      role: 'system',
      content: 'You are a helpful text summarizer.',
    },
    {
      role: 'user',
      content:
        'Summarize this set of definitions into a set of 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Omit direct wylie text or otherwise meaningless text. It's okay if the definition is longer than 3 words if it can't easily be shortened. Send the definitions as a JSON array of strings under the "definitions" key, and send the proposed part-of-speech under the "type" field, formatted in lowercase unless it is a proper name, and don't use abbreviations where they can be easily expanded to the normal word. Put the simplest 1-3 word definition first. That's it: ' +
        instructions,
    },
  ],
  model: 'gpt-4o',
})

Where instructions was like the raw JSON array of definitions from above, like:

const instructions = [
  "cure, remedy, treat",
  "*, examine a patient",
  "to *"
]

For each term/definitions pair, etc.. Here is the prompt in isolation for easier reading:

Summarize this set of definitions into a set of 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Omit direct wylie text or otherwise meaningless text. It’s okay if the definition is longer than 3 words if it can’t easily be shortened. Send the definitions as a JSON array of strings under the “definitions” key, and send the proposed part-of-speech under the “type” field, formatted in lowercase unless it is a proper name, and don’t use abbreviations where they can be easily expanded to the normal word. Put the simplest 1-3 word definition first. That’s it: [JSONArray]

The results I got were pretty good, here it is in CSV form:

ལན་ཆགས།,noun,retribution; karmic debts; misfortune; adversity; calamity; payment; karmic creditor; fate
དག་པ་འཇིག་རྟེན་པའི་ཡེ་ཤེས,noun,post-meditation knowledge; aftereffect of meditative equipoise; Suddhalaukikajnana
སྨན་བཅོས་སྐོར་བསྐྱོད,noun,mobile medical care; mobile medical attention; mobile health care services
སྨན་བཅོས་ཁང༌,noun,clinic; doctor's consulting room
སྨན་བཅོས་འཕྲོད་བསྟེན་སྡེ་ཁག,noun,health centers; medical services
སྨན་བཅོས་བྱེད,verb,cure; remedy; treat
སྨན་བཅོས་བྱེད་པ,verb,cure; remedy; treat; examine a patient
སྨན་བཅོས་མི་སྣ,noun,medical staff; health professional; healthcare worker

Things separated by semicolons are the array of definitions it returned. Here is my full script calling the OpenAI API:

import OpenAI from 'openai'
import 'dotenv/config'
import fs from 'fs/promises'

const PATH = `import/language/tibetan/definitions.summary.2.json`

const openai = new OpenAI({
  apiKey: process.env.OPEN_AI_API_KEY,
})

async function main() {
  let i = 0
  const records = JSON.parse(
    await fs.readFile(
      `import/language/tibetan/definitions.out.json`,
      `utf-8`,
    ),
  )

  const json = JSON.parse(
    await fs.readFile(
      `import/language/tibetan/definitions.summary.2.json`,
      'utf-8',
    ),
  )

  for (const word in records) {
    if (json[word]) {
      i++
      continue
    }
    let text = await getText(records[word].join('n'))
    try {
      if (typeof text === 'string') {
        try {
          text = text
            .split(/```json/)[1]
            .split(/```/)[0]
            .trim()
        } catch (e) {}
        console.log(text)
        const output = JSON.parse(text)

        json[word] = output
        console.log(i, word, output)

        await fs.writeFile(PATH, JSON.stringify(json, null, 2))
      } else {
        console.log(i)
      }
    } catch (e) {
      console.error(e)
      console.log(i)
    }

    i++

    async function getText(instructions: string) {
      const completion = await openai.chat.completions.create({
        messages: [
          {
            role: 'system',
            content: 'You are a helpful text summarizer.',
          },
          {
            role: 'user',
            content:
              'Summarize this set of definitions into a set of 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Omit direct wylie text or otherwise meaningless text. It's okay if the definition is longer than 3 words if it can't easily be shortened. Send the definitions as a JSON array of strings under the "definitions" key, and send the proposed part-of-speech under the "type" field, formatted in lowercase unless it is a proper name, and don't use abbreviations where they can be easily expanded to the normal word. Put the simplest 1-3 word definition first. That's it: ' +
              instructions,
          },
        ],
        model: 'gpt-4o',
      })

      return completion.choices[0].message.content
    }
  }
}

main()

How could I accomplish something similar with Transformers.js? (or Python equivalent, I assume there will be more python AI support here on SO).

  • Is it possible to do with Huggingface libraries somehow?
  • If so, what models/APIs should I use?
  • If you are so kind, could you put together a hello-world script to do this sort of summarization + data cleaning + returning JSON?

I’m not sure if I should be using one or a combo of these (or if it’s not possible):

  • pipelines.TextGenerationPipeline
  • pipelines.SummarizationPipeline
  • Others?

If it’s not possible with transformers.js or anything huggingface, is it possible to do with any free / open source AI tools? If so, could you briefly describe how to do accomplish this (and if you are so inclined, provide a hello-world example)? If it’s not possible, could you explain why OpenAI’s model gpt-4o is able to handle it, but not open source ones?

In the end, my current task of cleaning ~150k Tibetan definitions like my OpenAI approach is doing will cost ~$300-500, and I would like for that cost to go to $0 (just my personal development time).

Attempt

Starting out, I have this:

import 'dotenv/config'
import fs from 'fs/promises'

const {
  pipeline,
  AutoTokenizer,
  AutoModelForSeq2SeqLM,
} = require('@xenova/transformers')

async function summarizeDefinitions(definitions) {
  // Load the tokenizer
  const tokenizer = await AutoTokenizer.from_pretrained(
    'facebook/bart-large-cnn',
  )

  // Load the model
  const model = await AutoModelForSeq2SeqLM.from_pretrained(
    'facebook/bart-large-cnn',
  )

  const summarizer = await pipeline('summarization', model, tokenizer)

  const cleanedDefinitions = {}

  let i = 0
  for (const term in definitions) {
    const defs = definitions[term]
    const combinedDefs = defs.join('; ')

    // Summarize the combined definitions
    const summary = await summarizer(combinedDefs, {
      max_length: 100, // adjust length based on your requirements
      min_length: 1,
      do_sample: false,
    })

    // Clean up the summary to create 1-3 word definitions
    const cleaned = summary[0].summary_text
      .split('.')
      .map(s => s.trim())
      .filter(s => s.length > 0)
      .map(s =>
        s
          .split(',')
          .map(ss => ss.trim())
          .filter(ss => ss.length <= 3),
      )

    cleanedDefinitions[term] = {
      definitions: cleaned.flat(),
      // type: 'noun', // or determine part-of-speech based on your logic
    }

    if (i === 100) {
      break
    }

    i++
  }

  return cleanedDefinitions
}

async function main() {
  const definitions = JSON.parse(
    await fs.readFile(
      `import/language/tibetan/definitions.out.json`,
      `utf-8`,
    ),
  )

  const cleanedDefinitions = await summarizeDefinitions(definitions)
  console.log(cleanedDefinitions)
}

main()

But I am getting:

Error: Could not locate file: "https://huggingface.co/facebook/bart-large-cnn/resolve/main/tokenizer_config.json".

I don’t see that tokenizer_config.json on the git repo. Can I somehow download and reference a local one? Or what am I missing?

Also, will this be enough to get the summarization / cleaning / JSON array like OpenAI’s gpt-4o API provided?

Update

Starting to think transformers.js is not up to par with the Python version, so maybe you can explain how to do it in python.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật