Kagi Hitches Up with Wolfram
March 6, 2024
This essay is the work of a dumb dinobaby. No smart software required.
“Kagi + Wolfram” reports that the for-fee Web search engine with AI has hooked up with one of the pre-eminent mathy people innovating today. The write up includes PR about the upsides of Kagi search and Wolfram’s computational services. The article states:
…we have partnered with Wolfram|Alpha, a well-respected computational knowledge engine. By integrating Wolfram Alpha’s extensive knowledge base and robust algorithms into Kagi’s search platform, we aim to deliver more precise, reliable, and comprehensive search results to our users. This partnership represents a significant step forward in our goal to provide a search engine that users can trust to find the dependable information they need quickly and easily. In addition, we are very pleased to welcome Stephen Wolfram to Kagi’s board of advisors.
The basic wagon gets a rethink with other animals given a chance to make progress. Thanks, MSFT Copilot. Good enough, but in truth I gave up trying to get a similar image with the dog replaced by a mathematician and the pig replaced with a perky entrepreneur.
The integration of mathiness with smart search is a step forward, certainly more impressive than other firms’ recycling of Web content into bubble gum cards presenting answer. Kagi is taking steps — small, methodical ones — toward what I have described as “search enabled applications” and my friend Dr. Greg Grefenstette described in his book with the snappy title “Search-Based Applications: At the Confluence of Search and Database Technologies (Synthesis Lectures on Information Concepts, Retrieval, and Services, 17).”
It may seem like a big step from putting mathiness in a Web search engine to creating a platform for search enabled applications. It may be, but I like to think that some bright young sprouts will figure out that linking a mostly brain dead legacy app with a Kagi-Wolfram service might be useful in a number of disciplines. Even some super confident really brilliantly wonderful Googlers might find the service useful.
Net net: I am gratified that Kagi’s for-fee Web search is evolving. Google’s apparent ineptitude might give Kagi the chance Neeva never had.
Stephen E Arnold, March 6, 2024
Downloading Web Sites: Some Useful Information Available
February 20, 2020
Do you want to download a Web site or the content available from a specific url? What seems easy can become a tricky problem. For example, Google offers “feature” content which is more difficult to download than our DarkCyber video news program. Presumably flowing acrylic paint has more value than information about policeware software.
There are tools available; for example, Cyotek Web Copy and HTTrack, among others. But many of the available Web site downloaders often encounter problems with modern Web sites accessible via any “regular” browser. The challenges come from the general Wild West in which Internet accessible content resides.
One site ripping software goes an extra step. If you download the free version or pay for Microsys’ A1 Web Site Downloader, the developers have created a quite useful series of help pages. Many of the problems one can encounter trying suck down text, images, videos, or other content are addressed.
Navigate to the Microsys help pages and browse the list of topics available. Note that the help directs one to the A1 Web Site Downloader, but the information is likely to be useful if you are using another software or if you are trying to code your own site ripper.
The topics addressed by Microsys include:
- Some basics like how to restrict how many pages are grabbed
- Frequent problems encountered; for example, no urls located
- The types of “options” available; for instance, dealing with Unicode. These “options” provide a useful checklist of important functions to include if you are rolling your own downloader. If you are trying to decide what alternative to AI Web Site Download, the list is useful.
- A rundown of frequently encountered errors and response code; for example, hard and soft 404s
- A summary of common platforms. (We liked the inclusion of information about EBay store data.)
- General questions about the A1 software.
You can access the software and the useful help information via the Microsys Web site at this link. Version 1.0 is free. The current version is about US$40.
DarkCyber pays some attention to software which purports to download Web sites. If you want to download Dark Web sites or content accessible via an obfuscation system, you will have to look elsewhere or do your own programming.
Stephen E Arnold, February 20, 2020
Endgame Now Runs On Elastic
August 1, 2019
The Marvel Cinematic Universe “ended” a few weeks ago with Avengers: Endgame, so all and any news with the keyword “endgame” were overshadowed by Disney’s superhero franchise. It goes without saying that Elasticsearch’s acquisition of digital security company Endgame went unnoticed, but you can read about here at Computer Weekly: “Robust For Your Pleasure: Elastic Acquires Endgame.”
Elastic is a popular, open source text and analytics engine and its parent company also invented the Elastic Stack data analysis and visualization toolset. Purchasing Endgame was a practical decision for Elasticsearch, because moving into security technology is the next logical step for a company that specializes in data search, analytics, and visualization. Endgame specializes in endpoint prevention, detection, and response. Elastic wants its Stack Exchange to be more secure, particularly Security Information and Even Management (SIEM).
Elastic founder and CEO Shay Banon, Endgame CEO Nate Flick, and Endgame CTO Jamie Butler are excited about their team up. While it might not be as epic as a Guardians of the Galaxy and Avengers mashup, Elastic and Endgame are committing to transparency, user enablement, and openness as well as gaining more customers.
Elastic and Endgame offer their customers some of the best technology for data analytics, management, and security:
“As the creators of the Elastic Stack (Elasticsearch, Kibana, Beats, and Logstash), Elastic builds self-managed and SaaS offerings that claim to make data usable in real time at scale for use cases like application search, site search, enterprise search, logging, APM, metrics, security and business analytics. Endgame makes endpoint protection using machine learning technology that is supposedly capable of stopping everything from ransomware, to phishing and targeted attacks. The company says its USP lies in its hybrid architecture that offers both cloud administration and data localization that meets industry, regulatory and global compliance requirements.”
Together Elastic and Endgame will combine their powers for their customers’ benefit that could possibly deliver technology to rival even Shield or something Tony Stark could invent.
Whitney Grace, July 1, 2019
Quote to Note: Palantir Flaw
December 24, 2018
I read “Koverse Co-Founders Tap NSA Expertise to Build a Platform to Solve Unsolvable Tech Challenges.” Koverse is a big data company, based in Seattle. The firm’s engineers use the Apache Accumulo data management system. (Accumulo shares some DNA with the Google Bigtable data management system which is old enough to vote.)
Koverse’s competition includes Silicon Valley’s Palantir Technologies, a company worth billions that was started by PayPal co-founder Peter Thiel. Matsuo downplayed Palantir’s hype. “They have gaping holes in their product that we are starting to exploit,” he said.
That is an interesting comment about Palantir Technologies, a company which has captured a number of commercial and government customers. With an initial public offering rumored, Palantir may find the observations a bit negative.
The company offers its Precision search engine. The write up points out that Koverse has “unparalleled” scalability and security.
For more information about the NSA infused Koverse, navigate to www.koverse.com.
Stephen E Arnold, December 24, 2018
Understanding Intention: Fluffy and Frothy with a Few Factoids Folded In
October 16, 2017
Introduction
One of my colleagues forwarded me a document called “Understanding Intention: Using Content, Context, and the Crowd to Build Better Search Applications.” To get a copy of the collateral, one has to register at this link. My colleague wanted to know what I thought about this “book” by Lucidworks. That’s what Lucidworks calls the 25 page marketing brochure. I read the PDF file and was surprised at what I perceived as fluff, not facts or a cohesive argument.
The topic was of interest to my colleague because we completed a five month review and analysis of “intent” technology. In addition to two white papers about using smart software to figure out and tag (index) content, we had to immerse ourselves in computational linguistics, multi-language content processing technology, and semantic methods for “making sense” of text.
The Lucidworks’ document purported to explain intent in terms of content, context, and the crowd. The company explains:
With the challenges of scaling and storage ticked off the to-do list, what’s next for search in the enterprise? This ebook looks at the holy trinity of content, context, and crowd and how these three ingredients can drive a personalized, highly-relevant search experience for every user.
The presentation of “intent” was quite different from what I expected. The details of figuring out what content “means” were sparse. The focus was not on methodology but on selling integration services. I found this interesting because I have Lucidworks in my list of open source search vendors. These are companies which repackage open source technology, create some proprietary software, and assist organizations with engineering and integrating services.
The book was an explanation anchored in buzzwords, not the type of detail we expected. After reading the text, I was not sure how Lucidworks would go about figuring out what an utterance might mean. The intent-centric systems we reviewed over the course of five months followed several different paths.
Some companies relied upon statistical procedures. Others used dictionaries and pattern matching. A few combined multiple approaches in a content pipeline. Our client, a firm based in Madrid, focused on computational linguistics plus a series of procedures which combined proprietary methods with “modules” to perform specific functions. The idea for this approach was to reduce the errors in intent identification from accuracy between 65 percent to 80 percent to accuracy approaching and often exceeding 90 percent. For text processing in multi-language corpuses, the Spanish company’s approach was a breakthrough.
I was disappointed but not surprised that Lucidworks’ approach was breezy. One of my colleagues used the word “frothy” to describe the information in the “Understanding Intention” document.
As I read the document, which struck me as a shotgun marriage of generalizations and examples of use cases in which “intent” was important, I made some notes.
Let me highlight five of the observations I made. I urge you to read the original Lucidworks’ document so you can judge the Lucidworks’ arguments for yourself.
Imitation without Attribution
My first reaction was that Lucidworks had borrowed conceptually from ideas articulated by Dr. Gregory Grefenstette and his book Search Based Applications: At the Confluence of Search and Database Technologies. You can purchase this 2011 book on Amazon at this link. Lucidworks’ approach, unlike Dr. Grefenstette’s borrowed some of the analysis but did not include the detail which supports the increasing importance of using search as a utility within larger information access solutions. Without detail, the Lucidworks’ document struck me as a description of the type of solutions that a company like Tibco is now offering its customers.
Voice Search: An Amazon and Google Dust Up
January 26, 2017
I read “Amazon and Google Fight Crucial Battle over Voice Recognition.” I like the idea that Amazon and Google are macho brawlers. I think of folks who code as warriors. Hey, just because some programmers wear their Comicon costumes to work in Mountain View and Seattle, some may believe that code masters are wimps. Obviously they are not. The voice focused programmers are tough, tough dudes and dudettes.
I learned from a “real” British newspaper that two Viking-inspired warrior cults are locked in a battle. The fate of the voice search world hangs in the balance. Why is this dust up covered in more depth on Entertainment Tonight or the talking head “real” news television programs.
I learned:
The retail giant has a threatening lead over its rival with the Echo and Alexa, as questions remain over how the search engine can turn voice technology into revenue.
What? If there is a battle, it seems that Amazon has a “threatening lead.” How will Google respond? Online advertising? New products like the Pixel which, in some areas, is not available due to production and logistics issues?
No. Here’s the scoop from the Fleet Street experts:
The risk to Google is that at the moment, almost everyone starting a general search at home begins at Google’s home page on a PC or phone. That leads to a results page topped by text adverts – which help generate about 90% of Google’s revenue, and probably more of its profits. But if people begin searching or ordering goods via an Echo, bypassing Google, that ad revenue will fall. And Google has cause to be uncomfortable. The shift from desktop to mobile saw the average number of searches per person fall as people moved to dedicated apps; Google responded by adding more ads to both desktop and search pages, juicing revenues. A shift that cut out the desktop in favor of voice-oriented search, or no search at all, would imperil its lucrative revenue stream.
Do I detect a bit of glee in this passage? Google is responding in what is presented as a somewhat predictable way:
Google’s natural reaction is to have its own voice-driven home system, in Home. But that poses a difficulty, illustrated by the problems it claims to solve. At the device’s launch, one presenter from the company explained how it could speak the answer to questions such as “how do you get wine stains out of a rug?” Most people would pose that question on a PC or mobile, and the results page would offer a series of paid-for ads. On Home, you just get the answer – without ads.
Hasn’t Google read “The Art of War” which advises:
“Let your plans be dark and impenetrable as night, and when you move, fall like a thunderbolt.”
My hunch is that this “real” news write up is designed to poke the soft underbelly of Googzilla. That sounds like a great idea. Try this with your Alexa, “Alexa, how do I hassle Google?”
Stephen E Arnold, January 26, 2017
Some Things Change, Others Do Not: Google and Content
January 20, 2017
After reading Search Engine Journal’s, “The Evolution Of Semantic Search And Why Content Is Still King” brings to mind how there RankBrain is changing the way Google ranks search relevancy. The article was written in 2014, but it stresses the importance of semantic search and SEO. With RankBrain, semantic search is more of a daily occurrence than something to strive for anymore.
RankBrain also demonstrates how far search technology has come in three years. When people search, they no longer want to fish out the keywords from their query; instead they enter an entire question and expect the search engine to understand.
This brings up the question: is content still king? Back in 2014, the answer was yes and the answer is a giant YES now. With RankBrain learning the context behind queries, well-written content is what will drive search engine ranking:
What it boils to is search engines and their complex algorithms are trying to recognize quality over fluff. Sure, search engine optimization will make you more visible, but content is what will keep people coming back for more. You can safely say content will become a company asset because a company’s primary goal is to give value to their audience.
The article ends with something about natural language and how people want their content to reflect it. The article does not provide anything new, but does restate the value of content over fluff. What will happen when computers learn how to create semantic content, however?
Whitney Grace, January 20, 2016
Indexing: A Cautionary Example
November 17, 2015
i read “Half of World’s Museum Specimens Are Wrongly Labeled, Oxford University Finds.” Anyone involved in indexing knows the perils of assigning labels, tags, or what the whiz kids call metadata to an object.
Humans make mistakes. According to the write up:
As many as half of all natural history specimens held in the some of the world’s greatest institutions are probably wrongly labeled, according to experts at Oxford University and the Royal Botanic Garden in Edinburgh. The confusion has arisen because even accomplished naturalists struggle to tell the difference between similar plants and insects. And with hundreds or thousands of specimens arriving at once, it can be too time-consuming to meticulously research each and guesses have to be made.
Yikes. Only half. I know that human indexers get tired. Now there is just too much work to do. The reaction is typical of busy subject matter experts. Just guess. Close enough for horse shoes.
What about machine indexing? Anyone who has retrained an HP Autonomy system knows that humans get involved as well. If humans make mistakes with bugs and weeds, imagine what happens when a human has to figure out a blog post in a dialect of Korean.
The brutal reality is that indexing is a problem. When dealing with humans, the problems do not go away. When humans interact with automated systems, the automated systems make mistakes, often more rapidly than the sorry human indexing professionals do.
What’s the point?
I would sum up the implication as:
Do not believe a human (indexing species or marketer of automated indexing species).
Acceptable indexing with accuracy above 85 percent is very difficult to achieve. Unfortunately the graduates of a taxonomy boot camp or the entrepreneur flogging an automatic indexing system which is powered by artificial intelligence may not be reliable sources of information.
I know that this notion of high error rates is disappointing to those who believe their whizzy new system works like a champ.
Reality is often painful, particularly when indexing is involved.
What are the consequences? Here are three:
- Results of queries are incomplete or just wrong
- Users are unaware of missing information
- Failure to maintain either human, human assisted, or automated systems results in indexing drift. Eventually the indexing is just misleading if not incorrect.
How accurate is your firm’s indexing? How accurate is your own indexing?
Stephen E Arnold, November 17, 2015
Exclusive Interview: Danny Rogers, Terbium Labs
August 11, 2015
Editor’s note: The full text of the exclusive interview with Dr. Daniel J. Rogers, co-founder of Terbium Labs, is available on the Xenky Cyberwizards Speak Web service at www.xenky.com/terbium-labs. The interview was conducted on August 4, 2015.
Significant innovations in information access, despite the hyperbole of marketing and sales professionals, are relatively infrequent. In an exclusive interview, Danny Rogers, one of the founders of Terbium Labs, has developed a way to flip on the lights to make it easy to locate information hidden in the Dark Web.
Web search has been a one-trick pony since the days of Excite, HotBot, and Lycos. For most people, a mobile device takes cues from the user’s location and click streams and displays answers. Access to digital information requires more than parlor tricks and pay-to-play advertising. A handful of companies are moving beyond commoditized search, and they are opening important new markets such as secret and high value data theft. Terbium Labs can “illuminate the Dark Web.”
In an exclusive interview, Dr. Danny Rogers, one of the founders of Terbium Labs with Michael Moore, explained the company’s ability to change how data breaches are located. He said:
Typically, breaches are discovered by third parties such as journalists or law enforcement. In fact, according to Verizon’s 2014 Data Breach Investigations Report, that was the case in 85% of data breaches. Furthermore, discovery, because it is by accident, often takes months, or may not happen at all when limited personnel resources are already heavily taxed. Estimates put the average breach discovery time between 200 and 230 days, an exceedingly long time for an organization’s data to be out of their control. We hope to change that. By using Matchlight, we bring the breach discovery time down to between 30 seconds and 15 minutes from the time stolen data is posted to the web, alerting our clients immediately and automatically. By dramatically reducing the breach discovery time and bringing that discovery into the organization, we’re able to reduce damages and open up more effective remediation options.
Terbium’s approach, it turns out, can be applied to traditional research into content domains to which most systems are effectively blind. At this time, a very small number of companies are able to index content that is not available to traditional content processing systems. Terbium acquires content from Web sites which require specialized software to access. Terbium’s system then processes the content, converting it into the equivalent of an old-fashioned fingerprint. Real-time pattern matching makes it possible for the company’s system to locate a client’s content, either in textual form, software binaries, or other digital representations.
One of the most significant information access innovations uses systems and methods developed by physicists to deal with the flood of data resulting from research into the behaviors of difficult-to-differentiate sub atomic particles.
One part of the process is for Terbium to acquire (crawl) content and convert it into encrypted 14 byte strings of zeros and ones. A client such as a bank then uses the Terbium content encryption and conversion process to produce representations of the confidential data, computer code, or other data. Terbium’s system, in effect, looks for matching digital fingerprints. The task of locating confidential or proprietary data via traditional means is expensive and often a hit and miss affair.
Terbium Labs changes the rules of the game and in the process has created a way to provide its licensees with anti-fraud and anti-theft measures which are unique. In addition, Terbium’s digital fingerprints make it possible to find, analyze, and make sense of digital information not previously available. The system has applications for the Clear Web, which millions of people access every minute, to the hidden content residing on the so called Dark Web.
Terbium Labs, a start up located in Baltimore, Maryland, has developed technology that makes use of advanced mathematics—what I call numerical recipes—to perform analyses for the purpose of finding connections. The firm’s approach is one that deals with strings of zeros and ones, not the actual words and numbers in a stream of information. By matching these numerical tokens with content such as a data file of classified documents or a record of bank account numbers, Terbium does what strikes many, including myself, as a remarkable achievement.
Terbium’s technology can identify highly probable instances of improper use of classified or confidential information. Terbium can pinpoint where the compromised data reside on either the Clear Web, another network, or on the Dark Web. Terbium then alerts the organization about the compromised data and work with the victim of Internet fraud to resolve the matter in a satisfactory manner.
Terbium’s breakthrough has attracted considerable attention in the cyber security sector, and applications of the firm’s approach are beginning to surface for disciplines from competitive intelligence to health care.
Rogers explained:
We spent a significant amount of time working on both the private data fingerprinting protocol and the infrastructure required to privately index the dark web. We pull in billions of hashes daily, and the systems and technology required to do that in a stable and efficient way are extremely difficult to build. Right now we have over a quarter trillion data fingerprints in our index, and that number is growing by the billions every day.
The idea for the company emerged from a conversation with a colleague who wanted to find out immediately if a high profile client list was ever leaded to the Internet. But, said Rogers, “This individual could not reveal to Terbium the list itself.”
How can an organization locate secret information if that information cannot be provided to a system able to search for the confidential information?
The solution Terbium’s founders developed relies on novel use of encryption techniques, tokenization, Clear and Dark Web content acquisition and processing, and real time pattern matching methods. The interlocking innovations have been patented (US8,997,256), and Terbium is one of the few, perhaps the only company in the world, able to crack open Dark Web content within regulatory and national security constraints.
Rogers said:
I think I have to say that the adversaries are winning right now. Despite billions being spent on information security, breaches are happening every single day. Currently, the best the industry can do is be reactive. The adversaries have the perpetual advantage of surprise and are constantly coming up with new ways to gain access to sensitive data. Additionally, the legal system has a long way to go to catch up with technology. It really is a free-for-all out there, which limits the ability of governments to respond. So right now, the attackers seem to be winning, though we see Terbium and Matchlight as part of the response that turns that tide.
Terbium’s product is Matchlight. According to Rogers:
Matchlight is the world’s first truly private, truly automated data intelligence system. It uses our data fingerprinting technology to build and maintain a private index of the dark web and other sites where stolen information is most often leaked or traded. While the space on the internet that traffics in that sort of activity isn’t intractably large, it’s certainly larger than any human analyst can keep up with. We use large-scale automation and big data technologies to provide early indicators of breach in order to make those analysts’ jobs more efficient. We also employ a unique data fingerprinting technology that allows us to monitor our clients’ information without ever having to see or store their originating data, meaning we don’t increase their attack surface and they don’t have to trust us with their information.
For more information about Terbium, navigate to the company’s Web site. The full text of the interview appears on Stephen E Arnold’s Xenky cyberOSINT Web site at http://bit.ly/1TaiSVN.
Stephen E Arnold, August 11, 2015
Watson: Following in the Footsteps of America Online with PR, not CD ROMs
July 31, 2015
I am now getting interested in the marketing efforts of IBM Watson’s professionals. I have written about some of the items which my Overflight system snags.
I have gathered a handful of gems from the past week or so. As you peruse these items, remember several facts:
- Watson is Lucene, home brew scripts, and acquired search utilities like Vivisimo’s clustering and de-duplicating technology
- IBM said that Watson would be a multi billion dollar business and then dropped that target from 10 or 12 Autonomy scale operations to something more modest. How modest the company won’t say.
- IBM has tallied a baker’s dozen of quarterly reports with declining revenues
- IBM’s reallocation of employee resources continues as IBM is starting to run out of easy ways to trim expenses
- The good old mainframe is still a technology wonder, and it produces something Watson only dreams about: Profits.
Here we go. Remember high school English class and the “willing suspension of disbelief.” Keep that in mind, please.
ITEM 1: “IBM Watson to Help Cities Run Smarter.” The main assertion, which comes from unicorn land, is: “Purple Forge’s “Powered by IBM Watson” solution uses Watson’s question answering and natural language processing capabilities to let users ask questions and get evidence-based answers using a website, smartphone or wearable devices such as the Apple Watch, without having to wait for a call agent or a reply to an email.” There you go. Better customer service. Aren’t government’s supposed to serve its citizens? Does the project suggest that city governments are not performing this basic duty? Smarter? Hmm.
ITEM 2: “Why I’m So Excited about Watson, IBM’s Answer Man.” In this remarkable essay, an “expert” explains that the president of IBM explained to a TV interviewer that IBM was being “reinvented.” Here’s the quote that I found amusing: “IBM invented almost everything about data,” Rometty insisted. “Our research lab was the first one ever in Silicon Valley. Creating Watson made perfect sense for us. Now he’s ready to help everyone.” Now the author is probably unaware that I was, lo, these many years ago, involved with an IBM Herb Noble who was struggling to make IBM’s own and much loved STAIRS III work. I wish to point out that Silicon Valley research did not have its hands on the steering wheel when it came to the STAIRS system. In fact, the job of making this puppy work fell to IBM folks in Germany as I recall.
ITEM 3: “IBM Watson, CVS Deal: How the Smartest Computer on Earth Could Shake Up Health Care for 70m Pharmacy Customers.” Now this is an astounding chunk of public relations output. I am confident that the author is confident that “real journalism” was involved. You know: Interviewing, researching, analyzing, using Watson, talking to customers, etc. Here’s the passage I highlighted: “One of the most frustrating things for patients can be a lack of access to their health or prescription history and the ability to share it. This is one of the things both IBM and CVS officials have said they hope to solve.” Yes, hope. It springs eternal as my mother used to say.
If you find these fact filled romps through the market activating technology of Watson, you may be qualified to become a Watson believer. For me, I am reminded of Charles Bukowski’s alleged quip:
The problem with the world is that the intelligent people are full of doubts while the stupid ones are full of confidence.
Stephen E Arnold, July 31, 2015