More about NAMER, the Bitext Smart Entity Technology
January 14, 2025
A dinobaby product! We used some smart software to fix up the grammar. The system mostly worked. Surprised? We were.
We spotted more information about the Madrid, Spain based Bitext technology firm. The company posted “Integrating Bitext NAMER with LLMs” in late December 2024. At about the same time, government authorities arrested a person known as “Broken Tooth.” In 2021, an alert for this individual was posted. His “real” name is Wan Kuok-koi, and he has been in an out of trouble for a number of years. He is alleged to be part of a criminal organization and active in a number of illegal behaviors; for example, money laundering and human trafficking. The online service Irrawady reported that Broken Tooth is “the face of Chinese investment in Myanmar.”
Broken Tooth (né Wan Kuok-koi, born in Macau) is one example of the importance of identifying entity names and relating them to individuals and the organizations with which they are affiliated. A failure to identify entities correctly can mean the difference between resolving an alleged criminal activity and a get-out-of-jail-free card. This is the specific problem that Bitext’s NAMER system addresses. Bitext says that large language models are designed for for text generation, not entity classification. Furthermore, LLMs pose some cost and computational demands which can pose problems to some organizations working within tight budget constraints. Plus, processing certain data in a cloud increases privacy and security risks.
Bitext’s solution provides an alternative way to achieve fine-grained entity identification, extraction, and tagging. Bitext’s solution combines classical natural language processing solutions solutions with large language models. Classical NLP tools, often deployable locally, complement LLMs to enhance NER performance.
NAMER excels at:
- Identifying generic names and classifying them as people, places, or organizations.
- Resolving aliases and pseudonyms.
- Differentiating similar names tied to unrelated entities.
Bitext supports over 20 languages, with additional options available on request. How does the hybrid approach function? There are two effective integration methods for Bitext NAMER with LLMs like GPT or Llama are. The first is pre-processing input. This means that entities are annotated before passing the text to the LLM, ideal for connecting entities to knowledge graphs in large systems. The second is to configure the LLM to call NAMER dynamically.
The output of the Bitext system can generate tagged entity lists and metadata for content libraries or dictionary applications. The NAMER output can integrate directly into existing controlled vocabularies, indexes, or knowledge graphs. Also, NAMER makes it possible to maintain separate files of entities for on-demand access by analysts, investigators, or other text analytics software.
By grouping name variants, Bitext NAMER streamlines search queries, enhancing document retrieval and linking entities to knowledge graphs. This creates a tailored “semantic layer” that enriches organizational systems with precision and efficiency.
For more information about the unique NAMER system, contact Bitext via the firm’s Web site at www.bitext.com.
Stephen E Arnold, January 14, 2025
FOGINT: A Shocking Assertion about Israeli Intelligence Before the October 2023 Attack
January 13, 2025
One of my colleagues alerted me to a new story in the Jerusalem Post. The article is “IDF Could’ve Stopped Oct. 7 by Monitoring Hamas’s Telegram, Researchers Say.” The title makes clear that this is an “after action” analysis. Everyone knows that thinking about the whys and wherefores right of bang is a safe exercise. Nevertheless, let’s look at what the Jerusalem Post reported on January 5, 2025.
First, this statement:
“These [Telegram] channels were neither secret nor hidden — they were open and accessible to all.” — Lt.-Col. (res.) Jonathan Dahoah-Halevi
Telegram puts some “silent” barriers to prevent some third parties from downloading in real time active discussions. I know of one Israeli cyber security firm which asserts that it monitors Telegram public channel messages. (I won’t ask the question, “Why didn’t analysts at that firm raise an alarm or contact their former Israeli government employers with that information? Those are questions I will sidestep.)
Second, the article reports:
These channels [public Telegram channels like Military Tactics] were neither secret nor hidden — they were open and accessible to all. The “Military Tactics” Telegram channel even shared professional content showcasing the organization’s level of preparedness and operational capabilities. During the critical hours before the attack, beginning at 12:20 a.m. on October 7, the channel posted a series of detailed messages that should have raised red flags, including: “We say to the Zionist enemy, [the operation] coming your way has never been experienced by anyone,” “There are many, many, many surprises,” “We swear by Allah, we will humiliate you and utterly destroy you,” and “The pure rifles are loaded, and your heads are the target.”
Third, I circled this statement:
However, Dahoah-Halevi further asserted that the warning signs appeared much earlier. As early as September 17, a message from the Al-Qassam Brigades claimed, “Expect a major security event soon.” The following day, on September 18, a direct threat was issued to residents of the Gaza border communities, stating, “Before it’s too late, flee and leave […] nothing will help you except escape.”
The attack did occur, and it had terrible consequences for the young people killed and wounded and for the Israeli cyber security industry, which some believe is one of the best in the world. The attack suggested that marketing rather than effectiveness created an impression at odds with reality.
What are the lessons one can take from this report? The FOGINT team will leave that to you to answer.
Stephen E Arnold, January 13, 2025
Juicing Up RAG: The RAG Bop Bop
December 26, 2024
Can improved information retrieval techniques lead to more relevant data for AI models? One startup is using a pair of existing technologies to attempt just that. MarkTechPost invites us to “Meet CircleMind: An AI Startup that is Transforming Retrieval Augmented Generation with Knowledge Graphs and PageRank.” Writer Shobha Kakkar begins by defining Retrieval Augmented Generation (RAG). For those unfamiliar, it basically combines information retrieval with language generation. Traditionally, these models use either keyword searches or dense vector embeddings. This means a lot of irrelevant and unauthoritative data get raked in with the juicy bits. The write-up explains how this new method refines the process:
“CircleMind’s approach revolves around two key technologies: Knowledge Graphs and the PageRank Algorithm. Knowledge graphs are structured networks of interconnected entities—think people, places, organizations—designed to represent the relationships between various concepts. They help machines not just identify words but understand their connections, thereby elevating how context is both interpreted and applied during the generation of responses. This richer representation of relationships helps CircleMind retrieve data that is more nuanced and contextually accurate. However, understanding relationships is only part of the solution. CircleMind also leverages the PageRank algorithm, a technique developed by Google’s founders in the late 1990s that measures the importance of nodes within a graph based on the quantity and quality of incoming links. Applied to a knowledge graph, PageRank can prioritize nodes that are more authoritative and well-connected. In CircleMind’s context, this ensures that the retrieved information is not only relevant but also carries a measure of authority and trustworthiness. By combining these two techniques, CircleMind enhances both the quality and reliability of the information retrieved, providing more contextually appropriate data for LLMs to generate responses.”
CircleMind notes its approach is still in its early stages, and expects it to take some time to iron out all the kinks. Scaling it up will require clearing hurdles of speed and computational costs. Meanwhile, a few early users are getting a taste of the beta version now. Based in San Francisco, the young startup was launched in 2024.
Cynthia Murrell, December 26, 2024
Bitext NAMER: Simplifying Tracking of Translated Organizational Names
December 11, 2024
This blog post is the work of an authentic dinobaby. No smart software was used.
We wrote a short item about tracking Chinese names translated to English, French, or Spanish with widely varying spellings. Now Bitext’s entity extraction system can perform the same disambiguation for companies and non-governmental entities. Analysts may be looking for a casino which operates with a Chinese name. That gambling facility creates marketing collateral or gets news coverage which uses a different name or a spelling which is different from the operation’s actual name. As a result, missing a news item related to that operation is an on-going problem for some professionals.
Bitext has revealed that its proprietary technology can perform the same tagging and extraction process for organizational names in more than two dozen languages. In “Bitext NAMER Cracks Named Entity Recognition,” the company reports:
… issues arise with organizational names, such as “Sun City” (a place and enterprise) or aliases like “Yati New City” for “Shwe Koko”; and, in general, with any language that is written in non-Roman alphabet and needs transliteration. In fact, these issues affect to all languages that do not use Roman alphabet including Hindi, Malayalam or Vietnamese, since transliteration is not a one-to-one function but a one-to-many and, as a result, it generates ambiguity the hinders the work of analysts. With real-time data streaming into government software, resolving ambiguities in entity identification is crucial, particularly for investigations into activities like money laundering.
Unlike some other approaches — for instance, smart large language models — the Bitext NAMER technology:
- Identifies correctly generic names
- Performs type assignment; specifically, person, place, time, and organization
- Tags AKA (also known as) and pseudonyms
- Distinguishes simile names linked to unelated entitles; for example, Levo Chan.
The company says:
Our unique method enables accurate, multilingual entity detection and normalization for a variety of applications.
Bitext’s technology is used by three of the top five US companies listed on NASDAQ. The firm’s headquarters are in Madrid, Spain. For more information, contact the company via its Web site, www.bitext.com.
Stephen E Arnold, December 11, 2024
Entity Extraction: Not As Simple As Some Vendors Say
November 19, 2024
No smart software. Just a dumb dinobaby. Oh, the art? Yeah, MidJourney.
Most of the systems incorporating entity extraction have been trained to recognize the names of simple entities and mostly based on the use of capitalization. An “entity” can be a person’s name, the name of an organization, or a location like Niagara Falls, near Buffalo, New York. The river “Niagara” when bound to “Falls” means a geologic feature. The “Buffalo” is not a Bubalina; it is a delightful city with even more pleasing weather.
The same entity extraction process has to work for specialized software used by law enforcement, intelligence agencies, and legal professionals. Compared to entity extraction for consumer-facing applications like Google’s Web search or Apple Maps, the specialized software vendors have to contend with:
- Gang slang in English and other languages; for example, “bumble bee.” This is not an insect; it is a nickname for the Latin Kings.
- Organizations operating in Lao PDR and converted to English words like Zhao Wei’s Kings Romans Casino. Mr. Wei has been allegedly involved in gambling activities in a poorly-regulated region in the Golden Triangle.
- Individuals who use aliases like maestrolive, james44123, or ahmed2004. There are either “real” people behind the handles or they are sock puppets (fake identities).
Why do these variations create a challenge? In order to locate a business, the content processing system has to identify the entity the user seeks. For an investigator, chopping through a thicket of language and idiosyncratic personas is the difference between making progress or hitting a dead end. Automated entity extraction systems can work using smart software, carefully-crafted and constantly updated controlled vocabulary list, or a hybrid system.
Automated entity extraction systems can work using smart software, carefully-crafted and constantly updated controlled vocabulary list, or a hybrid system.
Let’s take an example which confronts a person looking for information about the Ku Group. This is a financial services firm responsible for the Kucoin. The Ku Group is interesting because it has been found guilty in the US for certain financial activities in the State of New York and by the US Securities & Exchange Commission.
Another Reminder about the Importance of File Conversions That Work
October 18, 2024
Salesforce has revamped its business plan and is heavily investing in AI-related technology. The company is also acquiring AI companies located in Israel. CTech has the lowdown on Salesforce’s latest acquisition related to AI file conversion: “Salesforce Acquiring Zoomin For $450 Million.”
Zoomin is an Israeli data management provider for unstructured at and Salesforce purchased it for $450 million. This is way more than what Zoomin was appraised at in 2021, so investors are happy. Earlier in September, Salesforce also bought another Israeli company Own. Buying Zoomin is part of Salesforce’s long term plan to add AI into its business practices.
Since AI need data libraries to train and companies also possess a lot of unstructured data that needs organizing, Zoomin is a wise investment for Salesforce. Zoomin has a lot to offer Salesforce:
“Following the acquisition, Zoomin’s technology will be integrated into Salesforce’s Agentforce platform, allowing customers to easily connect their existing organizational data and utilize it within AI-based customer experiences. In the initial phase, Zoomin’s solution will be integrated into Salesforce’s Data Cloud and Service Cloud, with plans to expand its use across all Salesforce solutions in the future.”
Salesforce is taking steps that other businesses will eventually follow. Will Salesforce start selling the converted data to train AI? Also will Salesforce become a new Big Tech giant?
Whitney Grace, October 18, 2024
Google Synthetic Content Scaffolding
September 3, 2024
This essay is the work of a dumb dinobaby. No smart software required.
Google posted what I think is an important technical paper on the arXiv service. The write up is “Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions.” The paper has six authors and presumably has the grade of “A”, a mark not award to the stochastic parrot write up about Google-type smart software.
For several years, Google has been exploring ways to make software that would produce content suitable for different use cases. One of these has been an effort to use transformer and other technology to produce synthetic data. The idea is that a set of real data is mimicked by AI so that “real” data does not have to be acquired, intercepted, captured, or scraped from systems in the real-time, highly litigious real world. I am not going to slog through the history of smart software and the research and application of synthetic data. If you are curious, check out Snorkel and the work of the Stanford Artificial Intelligence Lab or SAIL.
The paper I referenced above illustrates that Google is “close” to having a system which can generate allegedly realistic and good enough outputs to simulate the interaction of actual human beings in an online discussion group. I urge you to read the paper, not just the abstract.
Consider this diagram (which I know is impossible to read in this blog format so you will need the PDF of the cited write up):
The important point is that the process for creating synthetic “human” online discussions requires a series of steps. Notice that the final step is “fine tuned.” Why is this important? Most smart software is “tuned” or “calibrated” so that the signals generated by a non-synthetic content set are made to be “close enough” to the synthetic content set. In simpler terms, smart software is steered or shaped to match signals. When the match is “good enough,” the smart software is good enough to be deployed either for a test, a research project, or some use case.
Most of the AI write ups employ steering, directing, massaging, or weaponizing (yes, weaponizing) outputs to achieve an objective. Many jobs will be replaced or supplemented with AI. But the jobs for specialists who can curve fit smart software components to produce “good enough” content to achieve a goal or objective will remain in demand for the foreseeable future.
The paper states in its conclusion:
While these results are promising, this work represents an initial attempt at synthetic discussion thread generation, and there remain numerous avenues for future research. This includes potentially identifying other ways to explicitly encode thread structure, which proved particularly valuable in our results, on top of determining optimal approaches for designing prompts and both the number and type of examples used.
The write up is a preliminary report. It takes months to get data and approvals for this type of public document. How far has Google come between the idea to write up results and this document becoming available on August 15, 2024? My hunch is that Google has come a long way.
What’s the use case for this project? I will let younger, more optimistic minds answer this question. I am a dinobaby, and I have been around long enough to know a potent tool when I encounter one.
Stephen E Arnold, September 3, 2024
Suddenly: Worrying about Content Preservation
August 19, 2024
This essay is the work of a dumb dinobaby. No smart software required.
Digital preservation may be becoming a hot topic for those who rarely think about finding today’s information tomorrow or even later today. Two write ups provide some hooks on which thoughts about finding information could be hung.
The young scholar faces some interesting knowledge hurdles. Traditional institutions are not much help. Thanks, MSFT Copilot. Is Outlook still crashing?
The first concerns PDFs. The essay and how to is “Classifying All of the PDFs on the Internet.” A happy quack to the individual who pursued this project, presented findings, and provided links to the data sets. Several items struck me as important in this project research report:
- Tracking down PDF files on the “open” Web is not something that can be done with a general Web search engine. The takeaway for me is that PDFs, like PowerPoint files, are either skipped or not crawled. The author had to resort to other, programmatic methods to find these file types. If an item cannot be “found,” it ceases to exist. How about that for an assertion, archivists?
- The distribution of document “source” across the author’s prediction classes splits out mathematics, engineering, science, and technology. Considering these separate categories as one makes clear that the PDF universe is about 25 percent of the content pool. Since technology is a big deal for innovators and money types, losing or not being able to access these data suggest a knowledge hurdle today and tomorrow in my opinion. An entity capturing these PDFs and making them available might have a knowledge advantage.
- Entities like national libraries and individualized efforts like the Internet Archive are not capturing the full sweep of PDFs based on my experience.
My reading of the essay made me recognize that access to content on the open Web is perceived to be easy and comprehensive. It is not. Your mileage may vary, of course, but this write up illustrates a large, multi-terabyte problem.
The second story about knowledge comes from the Epstein-enthralled institution’s magazine. This article is “The Race to Save Our Online Lives from a Digital Dark Age.” To make the urgency of the issue more compelling and better for the Google crawling and indexing system, this subtitle adds some lemon zest to the dish of doom:
We’re making more data than ever. What can—and should—we save for future generations? And will they be able to understand it?
The write up states:
For many archivists, alarm bells are ringing. Across the world, they are scraping up defunct websites or at-risk data collections to save as much of our digital lives as possible. Others are working on ways to store that data in formats that will last hundreds, perhaps even thousands, of years.
The article notes:
Human knowledge doesn’t always disappear with a dramatic flourish like GeoCities; sometimes it is erased gradually. You don’t know something’s gone until you go back to check it. One example of this is “link rot,” where hyperlinks on the web no longer direct you to the right target, leaving you with broken pages and dead ends. A Pew Research Center study from May 2024 found that 23% of web pages that were around in 2013 are no longer accessible.
Well, the MIT story has a fix:
One way to mitigate this problem is to transfer important data to the latest medium on a regular basis, before the programs required to read it are lost forever. At the Internet Archive and other libraries, the way information is stored is refreshed every few years. But for data that is not being actively looked after, it may be only a few years before the hardware required to access it is no longer available. Think about once ubiquitous storage mediums like Zip drives or CompactFlash.
To recap, one individual made clear that PDF content is a slippery fish. The other write up says the digital content itself across the open Web is a lot of slippery fish.
The fix remains elusive. The hurdles are money, copyright litigation, and technical constraints like storage and indexing resources.
Net net: If you want to preserve an item of information, print it out on some of the fancy Japanese archival paper. An outfit can say it archives, but in reality the information on the shelves is a tiny fraction of what’s “out there”.
Stephen E Arnold, August 19, 2024
Sakana: Can Its Smart Software Replace Scientists and Grant Writers?
August 13, 2024
This essay is the work of a dumb dinobaby. No smart software required.
A couple of years ago, merging large language models seemed like a logical way to “level up” in the artificial intelligence game. The notion of intelligence aggregation implied that if competitor A was dumb enough to release models and other digital goodies as open source, an outfit in the proprietary software business could squish the other outfits’ LLMs into the proprietary system. The costs of building one’s own super-model could be reduced to some extent.
Merging is a very popular way to whip up pharmaceuticals. Take a little of this and a little of that and bingo one has a new drug to flog through the approval process. Another example is taking five top consultants from Blue Chip Company I and five top consultants from Blue Chip Company II and creating a smarter, higher knowledge value Blue Chip Company III. Easy.
A couple of Xooglers (former Google wizards) are promoting a firm called Sakana.ai. The purpose of the firm is to allow smart software (based on merging multiple large language models and proprietary systems and methods) to conduct and write up research (I am reluctant to use the word “original”, but I am a skeptical dinobaby.) The company says:
One of the grand challenges of artificial intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used to aid human scientists, e.g. for brainstorming ideas or writing code, they still require extensive manual supervision or are heavily constrained to a specific task. Today, we’re excited to introduce The AI Scientist, the first comprehensive system for fully automatic scientific discovery, enabling Foundation Models such as Large Language Models (LLMs) to perform research independently. In collaboration with the Foerster Lab for AI Research at the University of Oxford and Jeff Clune and Cong Lu at the University of British Columbia, we’re excited to release our new paper, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.
Sakana does not want to merge the “big” models. Its approach for robot generated research is to combine specialized models. Examples which came to my mind were drug discovery and providing “good enough” blue chip consulting outputs. These are both expensive businesses to operate. Imagine the payoff if the Sakana approach delivers high value results. Instead of merging big, the company wants to merge small; that is, more specialized models and data. The idea is that specialized data may sidestep some of the interesting issues facing Google, Meta, and OpenAI among others.
Sakana’s Web site provides this schematic to help the visitor get a sense of the mechanics of the smart software. The diagram is Sakana’s, not mine.
I don’t want to let science fiction get in the way of what today’s AI systems can do in a reliable manner. I want to make some observations about smart software making discoveries and writing useful original research papers or for BearBlog.dev.
- The company’s Web site includes a link to a paper written by the smart software. With a sample of one, I cannot see much difference between it and the baloney cranked out by the Harvard medical group or Stanford’s former president. If software did the work, it is a good deep fake.
- Should the software be able to assemble known items of information into something “novel,” the company has hit a home run in the AI ballgame. I am not a betting dinobaby. You make your own guess about the firm’s likelihood of success.
- If the software works to some degree, quite a few outfits looking for a way to replace people with a Sakana licensing fee will sign up. Will these outfits renew? I have no idea. But “good enough” may be just what these companies want.
Net net: The Sakana.ai Web site includes a how it works, more papers about items “discovered” by the software, and a couple of engineers-do-philosophy-and-ethics write ups. A “full scientific report” is available at https://arxiv.org/abs/2408.06292. I wonder if the software invented itself, wrote the documents, and did the marketing which caught my attention. Maybe?
Stephen E Arnold, August 13, 2024
Students, Rejoice. AI Text Is Tough to Detect
July 19, 2024
While the robot apocalypse is still a long way in the future, AI algorithms are already changing the dynamics of work, school, and the arts. It’s an unfortunate consequence of advancing technology and a line in the sand needs to be drawn and upheld about appropriate uses of AI. A real world example was published in the Plos One Journal: “A Real-World Test Of Artificial Intelligence Infiltration Of A University Examinations System: A ‘Turing Test’ Case Study.”
Students are always searching for ways to cheat the education system. ChatGPT and other generative text AI algorithms are the ultimate cheating tool. School and universities don’t have systems in place to verify that student work isn’t artificially generated. Other than students learning essential knowledge and practicing core skills, the ways students are assessed is threatened.
The creators of the study researched a question we’ve all been asking: Can AI pass as a real human student? While the younger sects aren’t the sharpest pencils, it’s still hard to replicate human behavior or is it?
“We report a rigorous, blind study in which we injected 100% AI written submissions into the examinations system in five undergraduate modules, across all years of study, for a BSc degree in Psychology at a reputable UK university. We found that 94% of our AI submissions were undetected. The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules there was an 83.4% chance that the AI submissions on a module would outperform a random selection of the same number of real student submissions.”
The AI exams and assignments received better grades than those written by real humans. Computers have consistently outperformed humans in what they’re programmed to do: calculations, play chess, and do repetitive tasks. Student work, such as writing essays, taking exams, and unfortunate busy work, is repetitive and monotonous. It’s easily replicated by AI and it’s not surprising the algorithms perform better. It’s what they’re programmed to do.
The problem isn’t that AI exist. The problem is that there aren’t processes in place to verify student work and humans will cave to temptation via the easy route.
Whitney Grace, July 19, 2024