Need to Tame the Information Tsunamis in Databases? DbSurfer May Be Your Deviled Egg
June 2, 2021
An interesting article “DbSurfer: A Search and Navigation Tool for Relational Databases” describes a novel way to locate information in Codd databases. Nope, I won’t make a reference to codfish. The surfing metaphor is good enough today.
The write up states:
We present a new application for keyword search within relational databases, which uses a novel algorithm to solve the join discovery problem by finding Memex-like trails through the graph of foreign key dependencies. It differs from previous efforts in the algorithms used, in the presentation mechanism and in the use of primary-key only database queries at query-time to maintain a fast response for users.
The Memex reference is not to the mostly forgotten Australian search and retrieval system. The Memex in this paper is a nod to everyone’s information hero Vannevar Bush’s fanciful “memex device.” (No, Google is not a memex device.)
The method involves “joins” and “tails.” The result is a system that allows keyword search and navigation through relational databases.
The paper includes a useful list of references. (Some recent computer science graduates who are billing themselves as search experts might find reading a few of the citations helpful. Just a friendly suggestion to the AI, NLP, and semantic whiz types.)
Is this a product? Nope, not yet. Interesting idea, however.
Stephen E Arnold, June 2, 2021
Endeca: In the News Again. Remarkable
May 31, 2021
Endeca is the outfit which was among the first of the search vendors pushing the concept of “facets” and “guided navigation.” The technology dates from 1999. The company was interesting because it used some fancy marketing concepts to paper over the manual effort required to get the system to group content and display classifications; for example, provide an Endeca system with articles about Beaujolais and the system would put the content in the “wine” category. Believe me, people loved the idea that the system could index words and concepts. And the human part? Yeah, after signing the deal, there was among some customers more appreciation for the human word and the computational load the system imposed on computing resources. Like most of the search systems of that era, the company ended up selling itself to Oracle. Oracle had an appetite for search technology; for example, Applied Linguistics, Triple Hop, and RightNow (also acquired in 2011 when search was “hot”), among others.
Now Oracle Endeca is back in the news. Frankly I was surprised to read “Oracle Boasted That Its Software Was Used against US Protesters. Then It Took the Tech to China.” My first question was, “When did this alleged taking “tech to China occur?” The answer right after Oracle bought Endeca in 2011. Why was Endeca for sale? Not germane to the write up. I think the answer to this question is; specifically, Endeca hit a revenue glass ceiling. The Endeca method (disclosed in part in USpat7035864 filed in year 2000) required some technical cartwheels apart from the MBA consulting work. Here’s an example of the “work” required to crank out useful facets:
The computational hoo-hah is one reason Endeca chased and caught some cash from Intel. The idea was to use Intel’s whiz bang multi core chips to increase the content processing speed. New MBAs and subject matter experts were available, but chips and Intel super tech. Wowza!
Wrong.
The issue with Endeca’s method is suggested in this statement from the article I was surprised to read:
At the peak of the NATO protests, police reportedly used Endeca to process 20,000 tweets an hour.
Okay, 20,000. How many tweets were flying around in 2011? According to a Twitter blog post, in 2013 the volume of tweets was 500 million per day which works out to about 5,700 per second. Knock these number down by 20 percent, and you still get the idea of the gulf between the tweet flow of around 250,000 per minute. Throughput? Yeah, let’s talk about how much actionable information can be derived for a real time event when the processing will have a tough time catching up with the protest.
Rules based, ageing technology which computationally intensive and pivots on human data massaging is not going to do the job for enterprise search, policeware, or intelware applications. A small ecommerce site selling wine? Perfect. The Twitter fire hose or a more challenging task like E2EE messaging? Highly unlikely.
There were more promising solutions, and what’s interesting is that Oracle invested in one of them. You will have to do some work to discern the connection between Oracle’s Irish investment operation and a company allegedly headquartered in Manchester Square in London, but the links are there. That’s a more interesting China-Oracle connection, and it one more relevant to monitoring the actions of companies than the Endeca deal.
By the way, on Oracle’s watch Endeca became sort of a market intelligence and ecommerce offering, not a stellar tool for the often questioned In-Q-Tel operation.
The write up ends with this quote attributed to a wizard:
“It still boggles my mind.”
What boggles my mind is that Endeca is not a particular timely product. Even more baffling is how the write up missed other, more significant Oracle China connections. Maybe a “real” journalist will visit Manchester Square and check out what companies do business from that location. One of them might be Oracle maybe?
Why did Oracle pitch the Endeca tech to China? The company was trying to generate a sustainable, high dollar return from this horse in the Oracle search and content processing corral. Like RightNow, some of those horses do not look like potential Kentucky Derby winners.
Stephen E Arnold, May 31, 2021
Another Way to Inject Ads into Semi-Relevant Content?
May 25, 2021
It looks like better search is just around the corner. Again. MIT Technology Review proclaims, “Language Models Like GPT-3 Could Herald a New Type of Search Engine.” Google’s PageRank has reigned over online search for over two decades. Even today’s AI search tech works as a complement to that system, used to rank results or better interpret queries. Now Googley researchers suggest a way to replace the ranking system altogether with an AI language model. This new technology would serve up direct answers to user queries instead of supplying a list of sources. Writer Will Douglas Heaven explains:
“The problem is that even the best search engines today still respond with a list of documents that include the information asked for, not with the information itself. Search engines are also not good at responding to queries that require answers drawn from multiple sources. It’s as if you asked your doctor for advice and received a list of articles to read instead of a straight answer. Metzler and his colleagues are interested in a search engine that behaves like a human expert. It should produce answers in natural language, synthesized from more than one document, and back up its answers with references to supporting evidence, as Wikipedia articles aim to do. Large language models get us part of the way there. Trained on most of the web and hundreds of books, GPT-3 draws information from multiple sources to answer questions in natural language. The problem is that it does not keep track of those sources and cannot provide evidence for its answers. There’s no way to tell if GPT-3 is parroting trustworthy information or disinformation—or simply spewing nonsense of its own making.”
The next step, then, is to train the AI to keep track of its sources when it formulates answers. We are told no models are yet able to do this, but it should be possible to develop that capability. The researchers also note the thorny problem of AI bias will have to be addressed for this approach to be viable. Furthermore, as search expert Ziqi Zhang at the University of Sheffield points out, technical and specialist topics often stump language models because there is far less relevant text on which to train them. His example—there is much more data online about e-commerce than quantum mechanics.
Then there are the physical limitations. Natural-language researcher Hanna Hajishirzi at the University of Washington warns the shift to such large language models would gobble up vast amounts of memory and computational resources. For this reason, she believes a language model will not be able to supplant indexing. Which researchers are correct? We will find out eventually. That is ok, we are used to getting ever less relevant search results.
Cynthia Murrell, May 25, 2021
Marketers Assert AI Perfect for eDiscovery
May 24, 2021
Automated eDiscovery firm ZyLab makes a case for AI in the law firm with its post, “A Chief Legal Officer’s Guide to AI-Based eDiscovery and Analytics,” shared at JDSupra. Writer Jeffrey Wolff begins by outlining the job of a CLO. He notes lawyers in that position tend to be most comfortable with the “traditional” duties of risk mitigation, monitoring legal matters, and minding laws and regulations. According to a Deloitte study, however, executives would like to see their CLOs work more on guiding the company culture and squaring legal concerns with company goals. Wolff suggests outsourcing this part of the CLO role. (We observe his company happens to offer such expert professional services.)
After that pitch, we learn why CLOs should consider AI. We’re told:
“AI excels at sifting through massive quantities of data to identify specific terms or concepts, even when those concepts are expressed in different terms. Because an AI system can scan data faster than any human and doesn’t get tired or distracted, it can evaluate data sets faster and more easily than a human while maintaining accuracy. A machine can also manage repetitive, laborious tasks quickly and effectively without falling prey to boredom or wandering attention. Legal departments can therefore use AI to streamline processes, reduce costs, and increase their productivity. Given that ‘nearly two-thirds (63 percent) of [legal department] respondents say recurring tasks and data management constraints prevent their legal teams from creating value at their organization,’ AI offers a way for CLOs to offload those time-consuming responsibilities and focus on the strategy and growth priorities that matter to the company’s future.”
A good place to start is with ZyLab’s specialty, eDiscovery. That area does involve a mind-boggling amount of data and AI can be quite valuable, even indispensable for larger firms. Wolff describes six ways AI tools can help with corporate eDiscovery: completing early case assessment, structuring data through concept clustering, using Technology-Assisted Review, redacting personal information, generating eDiscovery analytics, and managing eDiscovery costs. See the write-up for more on each of these tasks.
The company’s technology dates from 1983 (38 years ago). Today’s ZyLab supplies eDiscovery and Information Governance tech to large corporations, government organizations, regulatory agencies, and law firms around the world. The company launched with its release of the first full-text retrieval software for the PC. It’s eDiscovery/ Information Management platform was introduced in 2010. ZyLab is based in Amsterdam and has embraced the lingo of smart software like other eDiscovery firms.
Cynthia Murrell, May 24, 2021
Web Search: In Flux
May 17, 2021
I listened to an interview conducted by the host of the Big Technology podcast and Sridhar Ramaswamy, the former Xoogler who was in charge of Google Advertising for a number of years. Mr. Ramaswamy’s new venture is a subscription Web search engine. The interview was interesting, but I somehow missed the definition of what will be the “Web” content the system would index. I brought up this “missing question” at lunch today because the “Web” can mean different things to different searchers. Does the system search dynamic sites like those built on Shopify? Does it index forums and public discussion groups? Does it index password protected but no cost sites like Nextdoor.com? You get the idea without my tossing in videos, audio, and tabular data on government Web sites.
What the interview did not touch upon was the Infinity search system. You can get information about this $5.00 US per month service at this link. The system seems to be a combination of metasearch and proprietary indexing. Our tests, prior to its becoming a subscription service, were mixed. Overall, the results were not as useful as those retrieved from Swisscows.com, for example. The value proposition of the Xoogler’s subscription search service and Infinity seemed similar.
I want to mention that Yippy, the Web search component of Vivisimo seems to have gone offline. I thought the Vivisimo service was interesting even though the company focused on selling itself to IBM and becoming a cog in the IBM Big Data Watson world. The on-the-fly clustering was as good if not better than the original version of Northern Light clustering. As I listened to the explanation of why the time is right for subscription search of Web (whatever that means), I wondered why Yippy did not push aggressively for subscription revenues. Perhaps subscription services make sense when plugging assumptions into an Excel model? In real life, subscriptions are difficult.
The realities of Web (whatever that means) search is that costs go up. The brutal fact is that once content is indexed, that content must be revisited and changes discerned. Indexing changed content keeps the information in the index for those sites fresh. Also, the flows of new content mean that wonky new sites like those tallied by Product Hunt have to identified, indexed, and then passed to the update queue. The users are often indifferent to indexing update cycles. Web search engines have to allocate their resources among a number of different demands; for example, which sites get updated in near real time? What sites get indexed every six months like the US government Railway Retirement Board site? What sites get a look every couple of months?
And what about the rich media? The discussion groups? The Web sites which change their method of presenting content so that a crawler just skips the site? How deep does the crawler go? What happens to images? What about sites which require users to do something to get access; for example, a user name, a password, and then authentication on a smartphone?
Net net: The world of Web search is in flux. It is more difficult than at any time in my professional life to locate specific information. Maybe subscription services will do the trick? My hunch is that the lessons of the DataStars and Dialcoms and Lycoses will helpful to today’s innovators.
What you don’t remember DataStar? That’s one of the issues experts in search and retrieval face: Learning from yesterday’s innovators.
Stephen E Arnold, May 17, 2021
More Search Explaining: Will It Help an Employee Locate an Errant PowerPoint?
May 13, 2021
“Semantics, Ambiguity, and the role of Probability in NLU” is a search-and-retrieval explainer. After half a century of search explaining, one would think that the technology required to enter a keyword and get a list of documents in which the key word appears would be nailed down. Wrong.
“Search” in 2021 embraces many sub disciplines. These range from explicit index terms like the date of a document to more elusive tags like “sentiment” and “aboutness.” Boolean has been kicked to the curb. Users want to talk to search, at least to Alexa and smartphones. Users want smart software to deliver results without the user having to enter a query. When I worked at Booz, Allen & Hamilton, one of my colleagues (I think his name was Harvey Poppel, the smart person who coined the phrase “paperless office”) suggested that someday a smart system would know when a manager walked into his or her office. The smart software would display what the person needed to know for that day. The idea, I think, was that whist drinking herbal tea, the smart person would read the smart outputs and be more smart when meeting with a client. That was in the late 1970s, and where are we? On Zooms and looking at smartphones. Search is an exercise in frustration, and I think that is why venture firms continue to pour money into ideas, methods, concepts, and demos which have been recycled many times.
I once reproduced a chunk of Autonomy’s marketing collateral in a slide in one of my presentations. I asked those in the audience to guess at what company wrote the text snippet. There were many suggestions, but none was Autonomy. I doubt that today’s search experts are familiar with the lingo of search vendors like Endeca, Verity, InQuire, et all. That’s too bad because the prose used to describe those systems could be recycled with little or no editing for today’s search system prospects.
The write up in question is serious. The author penned the report late last year, but Medium emailed me a link to it a day ago along with a “begging for dollars” plea. Ah, modern online blogs. Works of art indeed.
The article covers these topics as part of the “search” explainer:
- Ambiguity
- Understanding
- Probability
Ambiguity is interesting. One example is a search for the word “terminal.” Does the person submitting the query want information about a computer terminal, a bus terminal, or some other type of terminal; for instance the post terminal on the transformer to my model train set circa 1951? Smart software struggles with this type of ambiguity. I want to point out that a subject matter expert can assign a “field code” to the term and eliminate the ambiguity, but SMEs are expensive and they lose their index precision capability as the work day progresses.
The deal with the “terminal” example, the modern system has to understand [a] what the user wants and [b] what the content objects are about. Yep, aboutness. Today’s smart software does an okay job with technical text because jargon like Octanitrocubane allows relatively on point identification of a document relevant to a chemist in Columbus, Ohio. Toss in a chemical structure diagram, and the precision of the aboutness ticks up a notch. However, if you search for a word replete with social justice meaning, smart software often has a difficult time figuring out the aboutness. One example is a reference to Skokie, Illinois. Is that a radical right wing code word or a town loved for Potawatomi linguistic heritage?
Probability is a bit more specific — usually. The idea in search is that numbers can illuminate some of the dark corners of text’s meaning. Examples are plentiful. Curious about Miley Cyrus on SNL and then at the after party? The search engine will display the most probable content based on whatever data is sluiced through the query matcher and stored in a cache. If others looked at specific articles, then, by golly, a query about Miley is likely or highly probable to be just what the searcher wanted. The difference between ambiguity, understanding, and probability is — in my opinion — part of the problem search vendors faces. No one can explain why, after 50 years of SMART, and Personal Library Software, STAIRS, et al, finding on point information remains frustrating, expensive, and ineffective.
The write up states:
ambiguity was not invented to create uncertainty — it was invented as a genius compression technique for effective communication. And it works like magic, because on the receiving end of the message, there is a genius decoding and decompression technique/algorithm to uncover all that was not said to get at the intended thought behind the message. Now we know very well how we compress our thoughts into a message using a genius encoding scheme, let us now concentrate on finding that genius decoding scheme — a task that we all call now ‘natural language understanding’.
Sounds great. Now try this test. You have a recollection of viewing a PowerPoint a couple of weeks ago at an offsite. You know who the speaker was and you want the slide with the number of instant messages sent per day on WhatsApp? How do you find that data?
[a] Run a query on your Fabasoft, SearchUnify, or Yext system?
[b] Run a query on Google in the hopes that the GOOG will point you to Statista, a company you believe will have the data?
[c] Send an email to the speaker?
[d] All of the above.
I would just send the speaker a text message and hope for an answer. If today’s search systems were smart, wouldn’t the single PowerPoint slide be in my email anyway? Sure, someday.
Stephen E Arnold, May 13, 2021
Be Cool with Boole
May 10, 2021
How often have you turned to a search engine to answer a question? You know the answer is on the tip of your tongue, but you cannot remember anything about it. Take that back, you do remember things about the answer, that is you know “what it is not.” For example, you are trying to remember the name of 1980s transforming robots but they are not Hasbro Transformers. Usually you could use the Boolean operator “not” in the search term, but that does not yield results.
Thankfully Tech Xplore explains that negative search options are on their way in the article: “New Approach Enables Search Engines To Describe Objects With Negative Statements.” Search engines and other computer programs use knowledge bases to answer user questions. The information must be structured in order for it to be discovered. Most information in knowledge bases use positive statements or statements that describe something true. Negative statements are not although they contain valuable information. They are to used, because there is an infinite number of negative statements; therefore impossible to structure every one.
Simon Razniewski of the Saarbücken Max-Planck-Institute for Informatics and his research team created a method to generate negative statements for knowledge bases in different applications. It works by:
“Using Steven Hawking as an example, the novel approach works as follows: First, several reference cases are identified that share a prominent property with the search object. In the example: physicists. The researchers call these comparison cases “peers.” Now, based on the “peers,” a selection of positive assumptions about the initial entity is generated. Since the physicists Albert Einstein and Richard Feynman won the Nobel Prize, the assumption Steven Hawking won the Nobel Prize could be made. Then, the new assumptions are matched with existing information in the knowledge base about the initial entity. If a statement applies to a “peer” but not to the search object, the researchers conclude that it is a negative statement for the search object—i.e., Steven Hawking never won the Nobel Prize. To evaluate the significance of the negative statements generated, they are sorted using various parameters, for example, how often they occurred in the peer group.”
The research team uses recommender systems like those in search engines or on commerce Web sites. They hope to refine the system to identify nuanced negative statements and implicit negative statements. Using negative statements will make search engines more intuitive and the research crosses over into the realms of NLP and AI. Boolean operators could become more obsolete.
Boolean may be back!
Whitney Grace, May 10, 2021
Online: Finding Info Is Easy or Another Dark Pattern?
May 7, 2021
When I attended meetings about online search, I found considerable amusement in comments like “Online makes finding information easy” and “I am an expert at finding information on Google.” Hoots for sure.
I read “How to Find a Buyer or Seller’s Facebook Profile on Marketplace.” According to the write up, at some time in the recent past “finding” information about a person offering something for sale on Facebook Marketplace was easy. Since I have never used Facebook Marketplace, I can accept the facile use of the word “easy” as something a normal thumbtyping Facebooker could do. Some investigators probably had the knowledge required to figure out who was pitching a product allegedly stolen from a bitcoin billionaire.
The write up identifies about nine steps in the process to navigate from a listing’s “seller handle” to the vendor’s Facebook profile. I thought this online search was easy.
I can think of several reasons why Facebook makes finding information difficult with weird words and wonky icons. (One of these was described as a “carrot” in the write up. A carrot? What’s up, Mark?
It is possible that Facebook wants to accrue clicks and stickiness. Since I don’t use Facebook, I am not a good judge of how sticky the site is. I do know that some individuals in government agencies think a lot about Facebook and the information the company’s databases contain.
Another possibility is that Facebook wants to make it more difficult for stalkers, miscreants, and investigators to move from a product listing to the seller information. The happy face side of me says, “Facebook cares about its users.” The frowny face says, “Facebook wants to make life difficult for anyone to get useful information because accountability is a bad thing.”
A third possibility is that Facebook’s engineers are just incompetent.
Net net: Finding information online is easy as long as one works at the organization with the data and the person doing the looking has root. Others get an opportunity to explore a Dark Pattern. Fun. Helpful even.
Stephen E Arnold, May 7, 2021
Searcher Beware or Turpiculo Puella Naso, Take What You Get
May 6, 2021
More Google ads, more questions like this one: How many would knowingly pay to have an algorithm dial a number for them? Apparently, searchers are being tricked into doing just that, we learn from this article posted at the Which? Press Office: “Misleading Customer Service Ads on Google are Costing Consumers, Which? Reveals.”
Researchers at Which?, a consumer advocacy organization in the UK, studied the results that popped up when they searched for car insurer’s phone numbers. They found both high-rate call connection services and claims-management companies often appeared at the top of the list, before the insurers’ own sites. The write-up tells us:
“Which? found one in five searches (21%) displayed adverts for ‘call connecting’ services at the top of the results. These adverts appear above the insurer’s number and when consumers tap on an advert, they’ll be taken to a website which displays a large phone number and a button that says ‘click to call’. Consumers will be put through to their insurer, but via a premium-rate phone number. The cost of making these calls can quickly escalate – with a 30-minute phone call costing £112.50 on Sky, £124.50 on Three and £127.50 on Vodafone.”
As of this writing, £1 equals $1.39 US. That is a lot to skip the bother of dialing (or copy-and-pasting) for oneself. Such ads officially violate Google’s rules, and the company swears it removes them. And yet, there they were. Then there are the claims management companies. We learn:
“The investigation also found ‘click to dial’ ads for claims management companies (CMCs) were rife and appeared in two in five searches (43%) for customer service phone numbers. ‘Click to dial’ ads have a clickable number in the search result itself. Some of these ads can trick customers to believe they’re contacting their insurer, when they’re actually being put through to a third-party to handle their claim, who will take a cut from any insurance payout. These charges often aren’t stated upfront on the CMCs websites and can catch consumers unaware.”
Insurers have been complaining to Google about these ads for years, but can do little about them but warn their customers. Only if the CMC performs certain deceptions, like using an insurer’s logo, can the company petition to have the ads removed. Less infringing tricks, like using the word “official,” are just fine by Google. To get their own ads to appear at the top, insurers must pay more and more protection (aka advertising) money to Google. Again, Google swears it does not allow misleading advertising. Which? is trying convince the search giant to do more to stop these ads, but they are battling uphill against the power of ad revenue. Meanwhile, users are reminded to check for the little word “Ad” in the top corner of search results and to check that results match the term they entered and state the name of the company they are trying to connect with. As long as Google refuses to protect its users, caution is required.
Cynthia Murrell, May 6, 2021
Reddit Search Engines: Some Tweaks Might Be Useful
May 6, 2021
Reddit is a popular and vast social media network. It is also a big disorganized mess. The likelihood of finding a thread you read on the main page three weeks ago is zero to null, unless you happened to make a comment on it. That, however, requires a Reddit account, but not everyone has one. Google and other search engines attempt to locate information on Reddit. Reddit attempts to do the same for itself. Both options have limited results.
Reddit search is a can of worms, much like the web site itself. Information can be found, but it requires a lot of digging. A specialized search algorithm specifically designed to handle the information dump that is Reddit would be the best option. Github hosts a Reddit Search application that does a fair job of locating information, although it has some drawbacks. The search filters are perfect for Reddit, focusing on the author, subreddit, score, dates, search terms, and searching through posts or comments. The more one knows about the post/comment they wish to locate, the better the search application is. However, if searching for basic information on a topic without filling in the subreddit, date span, or author delimiters spits back hundreds of search results. Reddit Search is similar to how most out of the box search tools function. They work, but need a lot of fine tuning before they are actually useful. Reddit Search does work as long as you have specific information to fill in the search boxes. Otherwise, it only returns semi useful results. The good news is that old Reddit is still available. Hunting remains the name of the game for some online information retrieval tasks.
Whitney Grace, May 6, 2021