Microsoft and Its Latest Search Innovation: Moving Past Fast? Nope
May 22, 2020
I read “Microsoft Search: Search Your Document Like You Search the Web.” Perhaps Microsoft did not get the reports about the demise of the Google Search Appliance. That “invention” made clear that searching a corporate content collection like you search the Web was not exactly the greatest thing since sliced bread. There were a number of reasons for the failure of the GSA. It was a black box. You know that mere mortals could not tune the relevance component. You know that it produced results that left employees wondering, “Where is the document I wrote yesterday?” You know that the corpus of Web content is different from the fruit cake of corporate content. Web search returns something because the system is rigged to find a way to display ads to the hapless searcher.
Contrast this with documents in the cloud, in different systems like that old AS/400 Ironsides application used by the warehouse supervisors, and content tucked away on employees’ USB drives, mobile phones, the oldest kid’s iPad, and on services a go to sales professional uses to store PowerPoints for “special” customers. Then there are the documents in the corporate legal office. The consultants’ reports scanned and stored on the Market Department’s computer kept for interns.
Nevertheless, the article explains:
We’re utilizing well-established web search technologies, such as query and document understanding, and adding deep learning based natural language models. This allows us to handle a much broader set of search queries beyond “exact match.”
Okay, query expansion, synonym look up, and Fast Search’s concept feature. But there’s more:
With the recent breakthroughs in deep learning techniques, you can now go beyond the common search term-based queries. The result is answers to your questions based on the document content. This opens a whole new way of finding knowledge. When you’re looking at a water quality report, you can answer questions like “where does the city water originate from? How to reduce the amount of lead in water?”
May I suggest that Microsoft and dozens of other enterprise search vendors have promised magical retrieval?
May I point out that the following content types are usually outside the ken of the latest and great enterprise search confection; for example:
- Quality control data on parts stored in an Autodesk engineering document
- Real time data flowing into an organization from sensors
- Video content, audio content, and rich media like photographs
- Classified or content restricted by certain constraints. (Access controls are often best implemented by specialized systems unknown to the greedy enterprise search indexing system.)
- Documents obtained through an eDiscovery process for legal matters.
Has Microsoft solved these problems? Sure, if everything (note the logically impossible categorical affirmative) is in an Azure repository, it is conceivable that a user query could return a particular content object.
But that’s Microsoft fantasy land, and it is about as likely as Mr. Nadella arriving at work on the back of a unicorn.
Microsoft feels compelled to reinvent search every year or two. The longest journey begins with a single step. It is just that Microsoft took those steps decades ago and still has not reached the now rubbelized Fred Harvey’s.
Stephen E Arnold, May 22, 2020
Lucidworks: Buzzwording in the Pandemic
May 19, 2020
Lucid Imagination (the outfit which contributed some Lucene/Solr talent to Amazon search) renamed itself Lucidworks. The company then embarked on becoming a West Coast version of Fast Search & Transfer, a Splunk like outfit, and now a customer support provider.
That’s a remarkable trajectory for a company built on open source software with more than $200 million in funding since 2007.
One of the DarkCyber researchers spotted “Lucidworks Develops Deep Learning Solution to Make Chatbots Smarter.” The story appeared in a New Zealand online publication. That’s interesting, but more intriguing is that Lucidworks is following in the marketing footsteps of Attivio, Coveo, and other vendors of search and retrieval. The destination customer service. Who doesn’t love automated customer support chat robots, self serve Web sites with smart software, and the general extinction of individuals who actually know a company’s software or hardware products?
The write up states:
Deep learning is essential for automated chatbots to understand natural language questions and to provide the right answers, which is something that AI-powered search firm Lucidworks has taken on board.
And why?
According to Lucidworks, companies rely on digital portals to provide information to users, whether digital commerce customers looking for product information before purchase, employees hunting for an HR document, or someone looking for an airline’s updated cancellation policies. Information is often scattered across disparate silos and is impossible for a user to locate using natural language questions.
But smart software is available from Amazon with a credit card and some free training courses. Outfits from Algolia to Voyager Search offer the service.
What is interesting is the buzzword salad tossed into this reheated plastic container of mapo tofu:
- AI (artificial intelligence)
- Automated
- Chatbots
- Conversational
- Deep learning
- Digital portals
- Engagement
- Experiences
- Fusion
- Natural language
- Satisfaction
- User intent
- Virtual assistants
Quite vocabulary and what seems an exercise in content marketing. Plus, eager customers in New Zealand will have an opportunity to help the company repay its investors the $200 million plus interest. That works out to 13 years in the enterprise search wilderness before arriving at chatbots.
Options abound and many of them are open source and well documented.
Stephen E Arnold, May 19, 2020
Boolean Is Better but Maybe Google Must Motor Through Ad Inventory by Relaxing Queries…a Lot?
May 17, 2020
A brief exchange on StackExchange demonstrates some common sense. One user, moseisley.2015, asks the community, “Should Default Search Behavior be ‘This AND That,’ or ‘This OR That’?” They elaborate:
“I have web application that shows lists of various data types … employees, customers, inventory items, orders, and so on. There’s one simple search field for doing a ‘global’ search … . Question is, when a user enters multi-word text in the field should the default search behavior be (1) this OR that or (2) this AND that? What default behavior do you think average users would expect?”
Their example lists four records: John Smith, John Jones, Michael Smith, and Betty Taylor-Smith. Would users expect the query “John Smith” to return just the first record (AND) or all four (OR)? As any online researcher from the ‘70s and ‘80s would tell you, the Boolean AND is the better default. The first respondent, SNag, sensibly writes:
“As a user, the more I type in, the more specific I’m expecting the results to get, and this is what happens with AND. With OR, your results would explode! If my search for popular Google Doodle games gave me everything that was popular, everything Google, everything Doodle and every game out there, I’d be lost! If you’re expecting your user to fetch all matching either John or Smith results, consider supporting syntax like John|Smith (where | is the logical symbol for OR) and placing a hint ? icon next to the search box to showcase the various supported syntaxes. You could also consider quotes in the search syntax for exact matches, where “Smith” wouldn’t match Taylor-Smith, but Smith would. “John”|”Smith” would then match all John and all Smith but not Betty Taylor-Smith.”
We concur. The second respondent, Big_Chair, adds a good observation—users without any programming background are probably unfamiliar with the | character and may need a more explicit cue that their query is about to return results based on OR rather than AND.
Cynthia Murrell, May 17 2020
Google: Regular Search Not Up to Covid19 Queries. Who Knew?
May 15, 2020
Google has launched a new semantic search tool designed to help researchers fight this pandemic. The Google AI Blog reveals “An NLU-Powered Tool to Explore COVID-19 Scientific Literature.” As one might expect, researchers around the world have been turning out an enormous number of papers on the disease and how we might fight it. Why does this call for a special tool? Google researcher Keith Hall writes:
“Traditional search engines can be excellent resources for finding real-time information on general COVID-19 questions like ‘How many COVID-19 cases are there in the United States?’, but can struggle with understanding the meaning behind research-driven queries. Furthermore, searching through the existing corpus of COVID-19 scientific literature with traditional keyword-based approaches can make it difficult to pinpoint relevant evidence for complex queries. To help address this problem, we are launching the COVID-19 Research Explorer, a semantic search interface on top of the COVID-19 Open Research Dataset (CORD-19), which includes more than 50,000 journal articles and preprints.”
Based on the BERT technology recently injected into the general Google Search, this bespoke semantic AI has been trained on biomedical literature. The team chose to build a hybrid term-neural retrieval model for this platform—a combination of keyword search and neural retrieval; see the article for the technical details. Here’s how the search functions:
“When the user asks an initial question, the tool not only returns a set of papers (like in a traditional search) but also highlights snippets from the paper that are possible answers to the question. The user can review the snippets and quickly make a decision on whether or not that paper is worth further reading. If the user is satisfied with the initial set of papers and snippets, we have added functionality to pose follow-up questions, which act as new queries for the original set of retrieved articles.”
The open-alpha platform is available for free to the research community, and Google plans to continue refining the system over the next few months. May this tool help scientists find solutions that much faster.
Cynthia Murrell, May 15, 2020
Deindexing: Does It Officially Exist?
May 14, 2020
DarkCyber noted “LinkedIn Temporarily Deindexed from Google.” The rock solid, hard news service stated:
LinkedIn found itself deindexed from Google search results on Wednesday, which may or may not have occurred due to an error on their part. The telltale sign of an entire domain being deindexed from Google is performing a “site:” search and seeing zero results.
Mysterious.
DarkCyber has fielded two reports of deindexing from Google in the last three days. I one case a site providing automobile data was disappeared. In another, a site focused on the politics of the intelligence sector was pushed from page one to the depths of page three.
Why?
No explanation, of course.
LinkedIn is owned by Microsoft. Is that a reason? Did LinkedIn’s engineers ignore a warning about a problem in AMP?
Google does not make errors. If a problem arises, the cause is the vaunted Google smart software.
DarkCyber’s view is that Google is taking stepped up action to filter certain types of content. We have documented that one Google office has access to controls that can selectively block certain content from appearing in the public facing Web search system. The content is indeed indexed and available to those with certain types of access.
What’s up? Here are our theories?
- Google is trying to deal with problematic content in a more timely manner by relaxing constraints on search engineers working in Google “virtual offices” around the world. Human judgments will affect some Web site. (Contacting Google is as difficult as it has been for the last 20 years.)
- Google wants to make sure that ads do not appear next to content that might cause a big spender to pull away. Google needs the cash. The thought is that Amazon and Facebook are starting to put a shunt in the money pipeline.
- Google is struggling to control costs. Slowing indexing, removing sites from a crawl, and pushing content that is rarely viewed to the side of the Information Superhighway reduces some of the costs associated with serving more than 95 percent of the queries launched by humans each day.
Regardless of the real reason or the theoretical ones, Google’s control over findable content can have interesting consequences. For example, more investigations are ramping up in Europe about the firm’s practices (either human or software centric).
Interesting. Too bad others affected by Google actions are not of the girth and heft of LinkedIn. Oh, well, the one percent are at the top for a reason.
Stephen E Arnold, May 14, 2020
New Arnold-Steele Discussion: Findability Is Terrible
May 7, 2020
Robert David Steele, a former CIA professional, stored a video of our recent discussion about finding open source information. The main point is that findability has degraded to the point that results are generally useless. Bing, Google, and other ad-supported systems have abandoned precision and relevance. Search results are a dog’s breakfast. To view the findabiity discussion, navigate to this link. The video was produced by Mr. Steele.
Stephen E Arnold, May 7, 2020
Search Engine Optimization: The Next Frontier Is Smart SEO
April 29, 2020
Content strategy plans are the most overlooked part of any Web site design and advertising campaign. Good content is integral to selling a product or a service, but not everyone is good at creating it. News Patrolling runs down the: “Best AI Tools For Content Marketing Strategy” and how AI is becoming an industry game changer.
Content is usually the first impression consumers have of companies. It is meant to engage the consumer, then:
“It serves as a tool to communicate with your audience. If you identify their pain points to provide them with a solution, they will trust you and be more interested in buying your offerings. The growth of your business depends on content strategy. It must be as effective as possible if you do not go downhill. Artificial intelligence can help you make an effective content marketing strategy. There are various tools to help you from targeting keywords to choosing the right topic. You will be surprised to know that AI tools can create a smarter content strategy by identifying the behaviour of users. Such software can help you increase revenues and reduce cost.”
The article recommends four content marketing software: Hubpost, Quill, Clearscope, and BrightEdge. Hubpost is advertised as using machine learning to help one get an edge on competition. The software analyzes keywords to discover what consumers want, then it clusters topics based on competition level.
Quill specializes in keyword optimization and generating quality content. Clearscope also optimizes content using keywords. It helps you generate keywords based on Google data and select the best keywords to use. Once you choose a keyword and write your post, Clearscope analyzes a post with other top-ranking posts.
BrightEdge is one integrated software solution that provides performance measurement, optimization, and keywords. It is described as a one-size-fits-all for content marketing strategies.
AI can provide insights into how to create the best content, but the most important part of a content strategy plan remains creative humans.
Yep, SEO is modernizing and automating methods to ensure that ad-supported Web search engines decide what matches a query. Precision, recall, and objectivity? Forget those irrelevant concepts.
Whitney Grace, April 29, 2020
Dig.ccMixter for Royalty-Free Tunes
April 22, 2020
Here is a resource that makers (and aspiring makers) of video content and games will want to bookmark. CCMixter is an online community where musicians share their work through creative commons licenses. Dig.ccMixter is our search portal into that content, free to download and use even for commercial purposes. Scrolling down reveals three categories: instrumental music for film & video; free music for commercial projects; and music for video games. Clicking the “Dig!” button leads to a keyword search page, where you can search by attributes like genre, mood, and instruments. The site’s About page, titled Yea, But Is It Legal? explains:
“This is a community music remixing site featuring remixes and samples licensed under Creative Commons licenses. Music on this site is licensed under a Creative Commons license. You are free to download and sample from music on this site and share the results with anyone, anywhere, anytime. Some songs might have certain restrictions, depending on their specific licenses. Each submission is marked clearly with the license that applies to it.”
So there you have it—a free source of music for your projects, even ones you intend to profit from. All you have to do is give credit where credit is due.
Interestingly, developers can also access the site’s ccHost Query API. We’re told:
“The ccHost Query API is an open, publicly available interface that is available for public use, especially by 3rd party websites, mobile applications, smart TV appliances and any other network connected device. We here at ccMixter use it to help expose the artists that upload their Creative Commons licensed music to audiences that otherwise would not have access to. The API and software implementation is owned by ArtIsTech Media under a license agreement with Creative Commons. The music itself is owned by the individual artists that uploaded it to the site and agree, through the Creative Commons licenses to share the music through this mechanism.”
Bing, Google, and Yandex are not suited for some types of music search. Enter Dig.cc Mixter. Applause, please.
Cynthia Murrell, April 22, 2020
Video Search: Maybe Find That for Which You Were Looking? Ha Ha
April 9, 2020
Searching for a motion picture online? It is collective intelligence to the rescue at Ask MetaFilter’s thread, “How to Find What Streaming Services Certain Films Are On?” Canadian poster NoneOfTheAbove was perusing this 1000 Greatest Films list and asked for an easy way to locate specific films on streaming services across the web.
The obvious is stated—use Google—with the caveat that those results may not tell you if a membership is required. Another suggestion is to follow links in the movie’s IMDb description, and one respondent notes that if one already has Roku, its search results point to sources available through that subscription. A couple people point to the streaming-service consolidator JustWatch, and one suggests Reelgood as a similar platform. The most descriptive answers, though, discuss Letterboxd:
“Another option is to sign up for a free membership with Letterboxd – that is a social-media movie-logging site that is really [darn] comprehensive. You can track what movies you want to see, what movies you have seen, and make endless lists of all kinds (‘Movies with female leads,’ ‘Movies with cute dogs,’ ‘Movies with Left-Handed Protaganists,’ whatever you want). A lot of members already have their own lists tracking their progress through the 1000 Greatest Movies list. Best of all – Letterboxd links to JustWatch and you can look at the streaming availability for a given movie when you pull it up on Letterboxd. So it may be fun to sign up for Letterboxd, make your own copy of the 1000 list, and then track your viewing progress. …Letterboxd also has a paid ‘Pro’ account where you can filter such a list based on a given streaming service like Netflix, but you may find that that’s overkill.” posted by EmpressCallipygos at 11:45 AM on March 31 [1 favorite]
“Bonus of having your own Letterboxd account is that you can already mark the ones you’ve seen and quickly visually scan for the ones you haven’t seen yet, then click through per film to see on which streaming services it’s available. I’ve been going through a bunch of the Criterion Collection this way recently myself. :D” posted by rather be jorting at 12:23 PM on March 31 [2 favorites]
So there you have several options supplied by the hive mind. Even if you aren’t looking for a film right now, this list may be worth bookmarking for future reference. Finding videos remains a challenge. Search has been solved, right? Yeah, sure.
Cynthia Murrell, April 9, 2020
Hyland Updates Document Processing Platform
April 8, 2020
Remember ISYS, the Australian search system? DarkCyber does. Hyland owns the technology. In a series of updates over the last six months, content-services provider Hyland Software has added file formats, capabilities, and support to its Document Filters platform, we learn from the press release posted by ProgrammableWeb, “Hyland Document Processing Update Includes New APIs.” The company aims to provide tools that allow its clients to process any type of file an organization may encounter in a typical day. Over 550 file formats are now supported. The write-up lists the new features:
- Text and metadata support for Apple iBook file types, Apple PList binary files, EPUB ebook file types, and Quattro Pro Spreadsheet files
- High definition support for NCR images, MS Project Gantt Charts, Microsoft Windows Clipboard (CLP) files, Microsoft Outlook for Mac OLK15MsgSource files, Paint Shop Pro images, Windows Cursor images, X-Windows-Bitmap images, X-Windows-Pixmap images, and WordPerfect Graphics (version 1)
- New API for extraction and processing of hierarchical bookmark information
- New API for the extraction and processing of static PDF form data
- Added option, DETECT_MACROS, that outputs a metadata value if macros are detected in MS Office documents
- New API to allow for adding common annotations such as notes, lines, shapes, polygons, and stamps. *When added to PDF output, annotations are created as native PDF annotations, that a user can interact with and modify
- New API to allow the control of graphic effects on a per page basis
- New option, GRAPHIC_ROTATE, to allow the rotation of an entire document rendition, or individual pages via the new graphic effects API
- Added support for mark-up and drawing functions onto an HTML5 canvas
With clients in several different industries, Hyland helps them leverage their data to better serve their own customers. It boasts that over half of 2019’s Fortune 100 companies use its products. Founded in 1991, the firm is based in Westlake, Ohio. How many years has ISYS been available? Good question, and DarkCyber knows the answer. If you said a number less than 30, you might be on a walkabout.
Cynthia Murrell, April 8, 2020