Background
What pieces would need to be involved using Transformers.js to distill/summarize/clean dictionary definitions which are messy and full of “junk”, and return a JSON array of short, summarized definitions?
For example, here are about 8 Tibetan terms which I sent to the OpenAI model gpt-4o
in Node.js:
{
"ལན་ཆགས།": [
"[2759] 1) gnod lan res 'khor gyi rgyun gnas pa/ ... srog gi lan chags/ ... tshe sngon gyi lan chags/ ... 2) (yul) 1/dka' tshegs/ ... 2/chags sgo / ...",
"a particular type of generalized karma which designates a relationship across lives in which the roles of the parties are reversed (a former master becomes the servant to his former servant, etc.). May also refer to the entities that seek retribution; divided into sha 'khon (those who bear malice towards our flesh) and rgyu 'khon (those who bear malice toward our property). Epstein, Dissertation 81, 106. Tucci, Religions 181. bu tsha lan chags snyog pa la byams par byas kyang mgu' ba myi 'ong. Zhi-byed Coll. II 304.7.",
"retribution (answer indebtedness), karmic debts due to past actions, payment",
"1) retribution gnod lan res 'khor gyi rgyun gnas pa; 2) misfortune, adversity, calamity; 3) door",
"{lan chags kyi mgron} (guests who are) karmic creditors",
"karmic debts incurred by. karmic creditor; karmic credit/ retribution/ a debt; disaster caused by karmic retribution/ fate; misfortune, adversity, calamity, retribution"
],
"དག་པ་འཇིག་རྟེན་པའི་ཡེ་ཤེས": [
"A term used by ZAntipa in his Hevajra sAdhana, for the ye shes in post-meditation phase, rjes thob (aftereffect of the AvikalpajñAna in meditative equipoise, mnyam gzhag). Skt. zuddhalaukikajñAna."
],
"སྨན་བཅོས་སྐོར་བསྐྱོད": [
"mobile med. care"
],
"སྨན་བཅོས་ཁང༌": [
"clinic",
"*, dr.s consulting room"
],
"སྨན་བཅོས་འཕྲོད་བསྟེན": [
"public health, health care"
],
"སྨན་བཅོས་འཕྲོད་བསྟེན་སྡེ་ཁག": [
"health clinics"
],
"སྨན་བཅོས་བྱེད": [
"cure, remedy, treat"
],
"སྨན་བཅོས་བྱེད་པ": [
"cure, remedy, treat",
"*, examine a patient",
"to *"
],
"སྨན་བཅོས་མི་སྣ": [
"med. personnel"
]
}
Notice how each array item is messy, uses abbreviations, or has non-English “junk” in it like {lan chags kyi mgron} (guests who are) karmic creditors
does. I asked OpenAI to simplify and distill the definitions to lowercased 1-3 word terms/phrases (if possible, otherwise make the definitions longer). I didn’t pass the Tibetan term, so it wouldn’t try to define it on it’s own. Here’s is basically what I sent to the OpenAI API:
const completion = await openai.chat.completions.create({
messages: [
{
role: 'system',
content: 'You are a helpful text summarizer.',
},
{
role: 'user',
content:
'Summarize this set of definitions into a set of 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Omit direct wylie text or otherwise meaningless text. It's okay if the definition is longer than 3 words if it can't easily be shortened. Send the definitions as a JSON array of strings under the "definitions" key, and send the proposed part-of-speech under the "type" field, formatted in lowercase unless it is a proper name, and don't use abbreviations where they can be easily expanded to the normal word. Put the simplest 1-3 word definition first. That's it: ' +
instructions,
},
],
model: 'gpt-4o',
})
Where instructions
was like the raw JSON array of definitions from above, like:
const instructions = [
"cure, remedy, treat",
"*, examine a patient",
"to *"
]
For each term/definitions pair, etc.. Here is the prompt in isolation for easier reading:
Summarize this set of definitions into a set of 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Omit direct wylie text or otherwise meaningless text. It’s okay if the definition is longer than 3 words if it can’t easily be shortened. Send the definitions as a JSON array of strings under the “definitions” key, and send the proposed part-of-speech under the “type” field, formatted in lowercase unless it is a proper name, and don’t use abbreviations where they can be easily expanded to the normal word. Put the simplest 1-3 word definition first. That’s it: [JSONArray]
The results I got were pretty good, here it is in CSV form:
ལན་ཆགས།,noun,retribution; karmic debts; misfortune; adversity; calamity; payment; karmic creditor; fate
དག་པ་འཇིག་རྟེན་པའི་ཡེ་ཤེས,noun,post-meditation knowledge; aftereffect of meditative equipoise; Suddhalaukikajnana
སྨན་བཅོས་སྐོར་བསྐྱོད,noun,mobile medical care; mobile medical attention; mobile health care services
སྨན་བཅོས་ཁང༌,noun,clinic; doctor's consulting room
སྨན་བཅོས་འཕྲོད་བསྟེན་སྡེ་ཁག,noun,health centers; medical services
སྨན་བཅོས་བྱེད,verb,cure; remedy; treat
སྨན་བཅོས་བྱེད་པ,verb,cure; remedy; treat; examine a patient
སྨན་བཅོས་མི་སྣ,noun,medical staff; health professional; healthcare worker
Things separated by semicolons are the array of definitions it returned. Here is my full script calling the OpenAI API:
import OpenAI from 'openai'
import 'dotenv/config'
import fs from 'fs/promises'
const PATH = `import/language/tibetan/definitions.summary.2.json`
const openai = new OpenAI({
apiKey: process.env.OPEN_AI_API_KEY,
})
async function main() {
let i = 0
const records = JSON.parse(
await fs.readFile(
`import/language/tibetan/definitions.out.json`,
`utf-8`,
),
)
const json = JSON.parse(
await fs.readFile(
`import/language/tibetan/definitions.summary.2.json`,
'utf-8',
),
)
for (const word in records) {
if (json[word]) {
i++
continue
}
let text = await getText(records[word].join('n'))
try {
if (typeof text === 'string') {
try {
text = text
.split(/```json/)[1]
.split(/```/)[0]
.trim()
} catch (e) {}
console.log(text)
const output = JSON.parse(text)
json[word] = output
console.log(i, word, output)
await fs.writeFile(PATH, JSON.stringify(json, null, 2))
} else {
console.log(i)
}
} catch (e) {
console.error(e)
console.log(i)
}
i++
async function getText(instructions: string) {
const completion = await openai.chat.completions.create({
messages: [
{
role: 'system',
content: 'You are a helpful text summarizer.',
},
{
role: 'user',
content:
'Summarize this set of definitions into a set of 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Omit direct wylie text or otherwise meaningless text. It's okay if the definition is longer than 3 words if it can't easily be shortened. Send the definitions as a JSON array of strings under the "definitions" key, and send the proposed part-of-speech under the "type" field, formatted in lowercase unless it is a proper name, and don't use abbreviations where they can be easily expanded to the normal word. Put the simplest 1-3 word definition first. That's it: ' +
instructions,
},
],
model: 'gpt-4o',
})
return completion.choices[0].message.content
}
}
}
main()
How could I accomplish something similar with Transformers.js? (or Python equivalent, I assume there will be more python AI support here on SO).
- Is it possible to do with Huggingface libraries somehow?
- If so, what models/APIs should I use?
- If you are so kind, could you put together a hello-world script to do this sort of summarization + data cleaning + returning JSON?
I’m not sure if I should be using one or a combo of these (or if it’s not possible):
pipelines.TextGenerationPipeline
pipelines.SummarizationPipeline
- Others?
If it’s not possible with transformers.js or anything huggingface, is it possible to do with any free / open source AI tools? If so, could you briefly describe how to do accomplish this (and if you are so inclined, provide a hello-world example)? If it’s not possible, could you explain why OpenAI’s model gpt-4o
is able to handle it, but not open source ones?
In the end, my current task of cleaning ~150k Tibetan definitions like my OpenAI approach is doing will cost ~$300-500, and I would like for that cost to go to $0 (just my personal development time).
Attempt
Starting out, I have this:
import 'dotenv/config'
import fs from 'fs/promises'
const {
pipeline,
AutoTokenizer,
AutoModelForSeq2SeqLM,
} = require('@xenova/transformers')
async function summarizeDefinitions(definitions) {
// Load the tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(
'facebook/bart-large-cnn',
)
// Load the model
const model = await AutoModelForSeq2SeqLM.from_pretrained(
'facebook/bart-large-cnn',
)
const summarizer = await pipeline('summarization', model, tokenizer)
const cleanedDefinitions = {}
let i = 0
for (const term in definitions) {
const defs = definitions[term]
const combinedDefs = defs.join('; ')
// Summarize the combined definitions
const summary = await summarizer(combinedDefs, {
max_length: 100, // adjust length based on your requirements
min_length: 1,
do_sample: false,
})
// Clean up the summary to create 1-3 word definitions
const cleaned = summary[0].summary_text
.split('.')
.map(s => s.trim())
.filter(s => s.length > 0)
.map(s =>
s
.split(',')
.map(ss => ss.trim())
.filter(ss => ss.length <= 3),
)
cleanedDefinitions[term] = {
definitions: cleaned.flat(),
// type: 'noun', // or determine part-of-speech based on your logic
}
if (i === 100) {
break
}
i++
}
return cleanedDefinitions
}
async function main() {
const definitions = JSON.parse(
await fs.readFile(
`import/language/tibetan/definitions.out.json`,
`utf-8`,
),
)
const cleanedDefinitions = await summarizeDefinitions(definitions)
console.log(cleanedDefinitions)
}
main()
But I am getting:
Error: Could not locate file: "https://huggingface.co/facebook/bart-large-cnn/resolve/main/tokenizer_config.json".
I don’t see that tokenizer_config.json
on the git repo. Can I somehow download and reference a local one? Or what am I missing?
Also, will this be enough to get the summarization / cleaning / JSON array like OpenAI’s gpt-4o
API provided?
Update
Starting to think transformers.js is not up to par with the Python version, so maybe you can explain how to do it in python.