Entity Extraction: Not As Simple As Some Vendors Say

November 19, 2024

dino orange_thumb_thumb_thumb_thumb_thumbNo smart software. Just a dumb dinobaby. Oh, the art? Yeah, MidJourney.

Most of the systems incorporating entity extraction have been trained to recognize the names of simple entities and mostly based on the use of capitalization. An “entity” can be a person’s name, the name of an organization, or a location like Niagara Falls, near Buffalo, New York. The river “Niagara” when bound to “Falls” means a geologic feature. The “Buffalo” is not a Bubalina; it is a delightful city with even more pleasing weather.

The same entity extraction process has to work for specialized software used by law enforcement, intelligence agencies, and legal professionals. Compared to entity extraction for consumer-facing applications like Google’s Web search or Apple Maps, the specialized software vendors have to contend with:

  • Gang slang in English and other languages; for example, “bumble bee.” This is not an insect; it is a nickname for the Latin Kings.
  • Organizations operating in Lao PDR and converted to English words like Zhao Wei’s Kings Romans Casino. Mr. Wei has been allegedly involved in gambling activities in a poorly-regulated region in the Golden Triangle.
  • Individuals who use aliases like maestrolive, james44123, or ahmed2004. There are either “real” people behind the handles or they are sock puppets (fake identities).

Why do these variations create a challenge? In order to locate a business, the content processing system has to identify the entity the user seeks. For an investigator, chopping through a thicket of language and idiosyncratic personas is the difference between making progress or hitting a dead end. Automated entity extraction systems can work using smart software, carefully-crafted and constantly updated controlled vocabulary list, or a hybrid system.

Automated entity extraction systems can work using smart software, carefully-crafted and constantly updated controlled vocabulary list, or a hybrid system.

Let’s take an example which confronts a person looking for information about the Ku Group. This is a financial services firm responsible for the Kucoin. The Ku Group is interesting because it has been found guilty in the US for certain financial activities in the State of New York and by the US Securities & Exchange Commission. 

Read more

Need a Specialized String Matcher for Tracking Entities?

January 21, 2020

Specialized services are available to track strings; for example, the name of an entity (person, place, event), an email handle, or any other string. These services may not be offered to the public. A potential customer has to locate a low profile operation, go through a weird series of interactions, and then work quite hard to get a demo of the super stealthy technology. Once the “I am a legitimate customer” drill is complete, the individual wanting to use the stealthy service has to pay hundreds, thousands, or even more per month. In our DarkCyber video program we have profiled some of these businesses.

No more.

image

The technology and possibly a massive expansion of monitoring is poised to make tools reserved for government agencies available to anyone with an Internet connection and a credit card. Brandchirps.com provides:

Online reputation management monitoring. The idea is that when the string entered in the standing query service appears, the user will be modified. The company says:

We allow you to input your brand, your name, or other data so you make sure your reputation stays up to date.

The service tracks competitors too. The service is easy to use:

Simply enter your competitor’s names and keep track of what they are doing right, or doing wrong!

How much does the service cost? Are we talking a letter verifying that you are working for law enforcement or an intelligence agency? A six figure budget? A staff of technologists.

Nope.

The cost of the service (as of January 20, 2020) is:

  • $7 per month for five keywords
  • $16 per month for 20 keywords

Several observations:

  • The cost for this service which allegedly monitors the Web and social media is very low. Government organizations strapped for cash are likely to check out this service.
  • The system does not cover the Dark Web and other “interesting” content, but that could be changed by licensing data sets from specialists, assuming legal and financial requirements of the Dark Web content aggregators can be negotiated by Brandchirps.
  • It is not clear at this time if the service monitors metadata on images and videos, podcast titles, descriptions, and metadata, or other high-value content.
  • The world of secret monitoring and alerts has become more accessible which can inspire innovators to make use of this tool in novel ways.

Net net: Brandchirps is one more example of a technique once removed from general public access that has lost its mantle of secrecy. Will this type of service force the hand of specialized vendors? Yep.

Stephen E Arnold, January 21, 2020

Microsoft Buys AnyVision: Why?

October 30, 2019

We noted “Why Did Microsoft Fund an Israeli Firm That Surveils West Bank Palestinians?” The write up stated:

Microsoft has invested in a startup that uses facial recognition to surveil Palestinians throughout the West Bank, in spite of the tech giant’s public pledge to avoid using the technology if it encroaches on democratic freedoms. AnyVision, which is headquartered in Israel but has offices in the United States, the United Kingdom and Singapore, sells an “advanced tactical surveillance” software system, Better Tomorrow. It lets customers identify individuals and objects in any live camera feed, such as a security camera or a smartphone, and then track targets as they move between different feeds.

The write up covers the functions of the firm’s technology. The contentious subject of facial recognition is raised.

However, one question was not asked, “Why?” Microsoft took action despite employee push back on certain projects.

The answer is, “Possess a technology that gets Microsoft closer to Amazon’s capabilities in this particular technical niche.

Microsoft has to beef up in a number of technical spaces. It may have a demanding client and a major project which requires certain capabilities. Marketing is one thing; delivering is another.

Stephen E Arnold, October 30, 2019

Facial Recognition and Image Recognition: Nervous Yet?

November 18, 2018

I read “A New Arms Race: How the U.S. Military Is Spending Millions to Fight Fake Images.” The write up contained an interesting observation from an academic wizard:

“The nightmare situation is a video of Trump saying I’ve launched nuclear weapons against North Korea and before anybody figures out that it’s fake, we’re off to the races with a global nuclear meltdown.” — Hany Farid, a computer science professor at Dartmouth College

Nothing like a shocking statement to generate fear.

But there is a more interesting image recognition observation. “Facebook Patent Uses Your Family Photos For Targeted Advertising” reports that a the social media sparkler has an invention that will

attempt to identify the people within your photo to try and guess how many people are in your family, and what your relationships are with them. So for example if it detects that you are a parent in a household with young children, then it might display ads that are more suited for such family units. [US20180332140]

While considering the implications of pinpointing family members and linking the deduced and explicit data, consider that one’s fingerprint can be duplicated. The dupe allows a touch ID to be spoofed. You can get the details in “AI Used To Create Synthetic Fingerprints, Fools Biometric Scanners.”

For a law enforcement and intelligence angle on image recognition, watch for DarkCyber on November 27, 2018. The video will be available on the Beyond Search blog splash page at this link.

Stephen E Arnold, November 18, 2018

Silobreaker Takes Gold and Silver in Online Decathlon

July 4, 2015

Short honk: I have been a fan of the Silobreaker system, which is available for commercial and governmental content processing. I read Network Products Guide “New Products and Service: Winners 10th Annual 2015 IT Awards” recommended solutions league table this morning. Silobreaker, founded by a couple of wizards with military and commercial experience. According to the league table, the Silobreaker content processing and information access system is the top dog for applications centering in Europe, the Middle East and Asia. This means that the system’s multi-lingual capabilities were the best, according to the Network Products Guide’s editors. The company also nailed a silver medal for US focused solutions. You can get more information about Silobreaker at www.silobreaker.com. Sign up. Join the thousands of users who want to work with a winner.

Stephen E Arnold, July 4, 2015

Looking Ahead for Bing Entity Engine

April 28, 2014

We know the Web search engines have been working to reduce the number of clicks between us and our desired information and/or action points. For Google, the mechanism behind this is called Knowledge Graph. For Bing, it’s the Entity Engine. Now, TechCrunch reports that “Microsoft Has Big Plans for Bing’s Entity Engine.”

Bing has always emphasized hooking users up with results that let them take action, like reserving a table or booking a flight. This increasingly means working with third-party sites. Reporter Frederic Lardinois interviewed Derrick Connell, head of the Bing Experience group. Lardinois writes:

“Connell argues that the only way to do this efficiently is to create an open ecosystem that powers these actions. ‘We think a lot about how we can create value for everybody who is participating in this new emerging space,” he said. “And how can we bring the best set of players to the table for our users?’

“Today, this means having partnerships with Yelp, OpenTable, TripAdvisor and others, and Microsoft then highlights the actions they make possible on its search engine. In the long run, though, Connell envisions an open ecosystem where any site can make actions available using a standard markup language (he mentioned schema.org as an option in our conversation). Then, when a user looks for an entity, Bing could map this to an entity provider and shorten the path users take between searching for something and putting this knowledge into action. Ideally, this could even mean taking the action right on Bing (maybe even with a single click), but Connell acknowledged that issues around identity and login management will probably mean users will have to take most actions on a third-party site.”

Unsurprisingly, Connell argues that Microsoft may be one of the only companies capable of building such a project. For now, as more third-party sites become involved, the problem is how to decide which gets the traffic from any particular search. Lardinois makes an interesting observation: the prevalence of the Microsoft Office suite means we could see the day when Bing lets us search the Web from within Word or Excel. Near-monopoly does have its advantages.

Cynthia Murrell, April 28, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Entity Extraction with Solr

July 11, 2013

Entity extraction is a feature that many enterprise users want to build into their architecture. Solr 4 has the features that allow a work around or “poor man’s” entity extraction. Erik Hatcher, one of the founders of LucidWorks, explains how in his SearchHub blog entry, “Poor Man’s ‘Entity’ Extraction with Solr.”

The instructions begin:

“Entity extraction, as defined on Wikipedia, ‘seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.’ When drilling down into the specifics of the requirements from our customers, it turns out that many of them have straightforward solutions using built-in (Solr 4.x) components, such as: Acronyms as facets; Key words or phrases, from a fixed list, as facets; Lat/long mentions as geospatial points.”

SearchHub is one of many means through which LucidWorks bolsters its support and training to all Apache Lucene Solr developers as well as LucidWorks customers. LucidWorks users find that both the LucidWorks Big Data and LucidWorks Search solutions are ready to go out-of-the-box but allow customization and scalability in a way that Hatcher demonstrates above.

Emily Rae Aldridge, July 11, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Digital Reasoning Hits High Note with Opera Solutions

October 23, 2012

We just learned that Digital Reasoning’s Synthesys® system now integrates with Opera Solutions’ Signal Hub™ Technology. For those struggling with big data challenges, the center stage role of Digital Reasoning brings new agility to organizations.

Digital Reasoning™, one of the leaders in unstructured data analytics at scale, revealed a partnership with Opera Solutions. The Opera Solutions’ firm is one of the leading global big data science companies. This integration will expand Opera Solutions’ Signal Hub technology to include unstructured text analytics from Digital Reasoning’s Synthesys, providing a comprehensive Big Data solution for innovative enterprises.

Rob Metcalf, President and COO for Digital Reasoning, told ArnoldIT:

We are excited to make this important announcement at the Strata Conference in the heart of New York’s Financial District. Opera Solutions’ predictive analytics capabilities are a perfect complement with Digital Reasoning’s Synthesys, and together we look forward to unlocking valuable insights for customers in the financial industry and beyond.

According to Opera Solutions, the Digital Reasoning Synthesys technology provides advanced methods of extracting critical insights from email, research, Web content, and other unstructured data sources.

Laks Srinivasan, General Manager, Global Markets for Opera Solutions, told me:

When combined with the machine-learning science in our Signal Hub technologies, we can deliver directed actions to frontline decision makers focused on managing financial risk.

ArnoldIT’s analysis of Signal Hub technologies revealed that the system can extract critical signals from flows of big data. Opera Solutions’ technology is deployed at major organizations spanning industries such as financial services, healthcare, retail, and transportation. These organizations rely on Signal Hubs to optimize their supply chain, gain unique marketing insights, and stay ahead of financial risk with knowledge from both internal systems and external, unstructured sources.

Readers of Beyond Search know that Synthesys was developed in close conjunction with critical Big Data analytics challenges over the past decade. In 2012, Digital Reasoning gained momentum in financial services, healthcare, and legal markets, where automated understanding of unstructured data is a necessity.

Synthesys is a platform for making sense of unstructured data. Modeled after the human understanding process, Synthesys reads, resolves and reasons across hundreds of millions of documents to automatically understand and isolate critical information such as risks, opportunities and anomalies. Having solved problems for US intelligence agencies for the past decade, Synthesys is now delivering Automated Understanding for big data challenges in finance, healthcare and legal markets. Digital Reasoning is based outside of Nashville, Tennessee, with offices in Washington, D.C., and New York. For more information, visit http://www.digitalreasoning.com/.

Stephen E Arnold, October 23, 2012

Protected: Exclusive Interview: David B. Camarata, IKANOW

April 9, 2012

This content is password protected. To view it please enter your password below:

The Summer of Big Deals

September 1, 2011

Will These Blockbusters Affect Business Intelligence?

The summer has been a hot one, not in terms of temperature, but when measured on the acquisition thermometer. First, Oracle the sprawling database and enterprise applications company bought InQuira. Then, Google took one third of its cash and the equivalent of two years’ profit and bought Motorola Mobility. And Hewlett Packard, one of the icon’s of the Silicon Valley way, spent $10 billion on its surprise purchase of Autonomy plc.

Business intelligence, intellectual property, and information management turned up the heat for investors and those tracking active market sectors. The market interest is high and many think these deals are likely to sustain their energy. But I don’t see it that way. I think the deals are more like dumping charcoal starter on charcoal briquettes: Very dramatic at ignition but certain to cool and fade into the fabric of day-to-day activity.

image

Starting a charcoal fire can produce some initial pyrotechnics. These fade quickly.

As the founder of Digital Reasoning, a company focused on delivering the next-generation solution-based on entity oriented analytics, I see these deals from the perspective of working with customers to solve big data analytics challenges. First, let me give you my view of information management and traditional business analytics and then outline where I think the technology and the market are going.

Business intelligence in general and analytics particular are now verbal noise. I know that most of the professionals with whom I speak interpret the phrase “business intelligence” in terms of their own experiences in getting information to make a decision. For some, business intelligence is a report and follow up telephone conversation with a human expert. Don’t get me wrong, consultants and advisors often do great work, but my point is that the phrase “business intelligence” is anchored in a method of information analysis rooted in human behavior unchanged since our ancestors sat around the camp fire roasting meat on sticks.,

The word analytics is equally difficult to explain. For many of our clients, analytics means SAS or SPSS (both the bread and butter of traditional statistics courses and business analysts from banking to warehouse management).

Read more

  • Archives

  • Recent Posts

  • Meta