An Exploration of Search Code

April 9, 2021

Software engineer Bard de Geode posts an exercise in search coding on his blog—“Building a Full-Text Search Engine in 150 Lines of Python Code.” He has pared down the thousands and thousands of lines of code found in proprietary search systems to the essentials. Of course, those platforms have many more bells and whistles, but this gives one an idea of the basic components. Navigate to the write-up for the technical details and code snippets that I do not pretend to follow completely. The headings de Geode walks us through include Data, Data preparation, Indexing, Analysis, Indexing the corpus, Searching, Relevancy, Term frequency, and Inverse document frequency. He concludes:

“You can find all the code on Github, and I’ve provided a utility function that will download the Wikipedia abstracts and build an index. Install the requirements, run it in your Python console of choice and have fun messing with the data structures and searching. Now, obviously this is a project to illustrate the concepts of search and how it can be so fast (even with ranking, I can search and rank 6.27m documents on my laptop with a ‘slow’ language like Python) and not production grade software. It runs entirely in memory on my laptop, whereas libraries like Lucene utilize hyper-efficient data structures and even optimize disk seeks, and software like Elasticsearch and Solr scale Lucene to hundreds if not thousands of machines. That doesn’t mean that we can’t think about fun expansions on this basic functionality though; for example, we assume that every field in the document has the same contribution to relevancy, whereas a query term match in the title should probably be weighted more strongly than a match in the description. Another fun project could be to expand the query parsing; there’s no reason why either all or just one term need to match.”

Fore more information, de Geode recommends curious readers navigate to MonkeyLearn’s post “What is TF-IDF?” and to an explanation of “Term Frequency and Weighting” posted by Stanford’s NLP Group. Happy coding.

Cynthia Murrell, April 9, 2021

Facebook Security: Fodder for Testimony?

April 9, 2021

Who knows if this is true? “533 Million Facebook Users’ Phone Numbers Leaked on Hacker Forum.” The write up states:

The mobile phone numbers and other personal information for approximately 533 million Facebook users worldwide has been leaked on a popular hacker forum for free. The stolen data first surfaced on a hacking community in June 2020 when a member began selling the Facebook data to other members.

If true, the revelation is a nice complement to a series of outstanding achievements by the centralized, big tech, really smart managers at super important companies. Examples include:

  • Twitter’s senior manager spoofing elected officials
  • Microsoft’s Exchange Server misstep when Windows Defender was on the job sort of
  • Amazon’s brilliant Twitter campaign about workers’ inexplicable need to take breaks
  • Google’s staunch defense of employees who grouse with assurances of continued employment.

Now Mr. Zuckerberg’s digital nation and its outstanding security.

How did this happen? The write up asserts:

According to Alon Gal, CTO of cybercrime intelligence firm Hudson Rock, it is believed that threat actors exploited in 2019 a now-patched vulnerability in Facebook’s “Add Friend” feature that allowed them to gain access to member’s phone numbers.

I envision Mr. Zuckerberg answering this question under oath in an upcoming Congressional hearing:

Senator X: Mr. Zuckerberg, what the heck happened? I have a teen age grand daughter. Are you protecting her?

Mr. Zuckerberg: Senator, thank you for that question. At Facebook, we take every possible precaution to guard our user’s identify. I will look into this matter and provide a report written by an Amazon PR person whom we just hired, and assign the former head of Microsoft security also a new hire to investigate this matter. Early reports suggest that the 1,000 criminals attacking Microsoft were supplemented with an additional 2,000 bad actors to breach our highly secure system.

Plus, the loss of data affected a mere 533 million users. Trivial. It is old news too.

Stephen E Arnold, April 9, 2021

Microsoft Adds Semantic Search to Azure Cognitive Search: Is That Fast?

April 9, 2021

Microsoft is adding new capabilities to its cloud-based enterprise search platform Azure Cognitive Search, we learn from “Microsoft Debuts AI-Based Semantic Search on Azure” at Datanami. We’re told the service offers improved development tools. There is also a “semantic caption” function that identifies and displays a document’s most relevant section. Reporter George Leopold writes:

“The new semantic search framework builds on Microsoft’s AI at Scale effort that addresses machine learning models and the infrastructure required to develop new AI applications. Semantic search is among them. The cognitive search engine is based on the BM25 algorithm, (as in ‘best match’), an industry standard for information retrieval via full-text, keyword-based searches. This week, Microsoft released semantic search features in public preview, including semantic ranking. The approach replaces traditional keyword-based retrieval and ranking frameworks with a ranking algorithm using deep neural networks. The algorithm prioritizes search results based on how ‘meaningful’ they are based on query relevance. Semantics-based ranking ‘is applied on top of the results returned by the BM25-based ranker,’ Luis Cabrera-Cordon, group program manager for Azure Cognitive Search, explained in a blog post. The resulting ‘semantic answers’ are generated using an AI model that extracts key passages from the most relevant documents, then ranks them as the sought-after answer to a query. A passage deemed by the model to be the most likely to answer a question is promoted as a semantic answer, according to Cabrera-Cordon.”

By Microsoft’s reckoning, the semantic search feature represents hundreds of development years and millions of dollars in compute time by the Bing search team. We’re told recent developments in transformer-based language models have also played a role, and that this framework is among the first to apply the approach to semantic search. There is one caveat—right now the only language the platform supports is US English. We’re told that others will be added “soon.” Readers who are interested in the public preview of the semantic search engine can register here.

Cynthia Murrell, April 9, 2021

The Alphabet Google YouTube Thing Explains Good Old Outcome Centered Design

April 8, 2021

If you have tried to locate information on a Google Map, you know what good design is, right? What about trying to navigate the YouTube upload interface to add or delete a “channel”? Perfection, okay. What if you have discovered an AMP error email and tried to figure out how a static Web site generated by an AMP approved “partner” can be producing a single flawed Web page? Intuitive and helpful, don’t you think?

Truth is: Google Maps are almost impossible to use regardless of device. The YouTube interface is just weird and better for a 10-year-old video game player than a person over 30, and the AMP messages? Just stupid.

I read “Waymo’s 7 Principles of Outcome-Centered Design Are What Your Product Needs” and thought I stumbled upon a listicle crafted by Stephen Colbert and Jo Koy in the O’Hare Airport’s Jazz Bar.

Waymo (so named because one get way more with Alphabet Google YouTube — hereinafter, AGYT)technology — is managed by co-CEOs. It is semi famous for hiring uber engineer Anthony Levandowski. Plus the company has been beavering away to make driving down 101 semi fun since 2009. The good news is that Waymo seems to be making more headway than the Google team trying to solve death. The Wikipedia entry for Waymo documents 12 collisions, but the exact number of smart  errors by the Alphabet Google YouTube software is not known even to some Googlers. Need to know, you know.

What are the rules for outcome centered design; that is, ads but no crashes I presume. The write up presents seven. Here are three and you can let your Chrome browser steer you to the full list. Don’t run into the Tesla Web site either, please.

Principle 2. Create focus by clarifying you8r purpose.

Okay, focus. Let’s see. When riding in a vehicle with no human in charge, the idea is to avoid a crash. What about filtering YouTube for okay content? Well, that only works some of the time. The Waymo crashes appear to underscore the fuzz in the statistical routines.

And Principle 4. Clue in to your customer’s context.

Yep, in a vehicle which knows one browsing history and has access to nifty profiles with probabilities allows the vehicle to just get going. Forget what the humanoid may want. Alphabet Google YouTube is ahead of the humanoid. Sometimes. The AFYT approach is to trim down what the humanoid wants to three options. Close enough for horse shoes. Waymo, like Alphabet Google YouTube, knows best. Just like a digital mistress. The humanoid, however, is going to a previously unvisited location. Another humanoid told the rider face to face about an emergency. The AGYT system cannot figure out context. Not to worry. Those AGYT interfaces will make everything really easy. One can talk to the Waymo equipped smart vehicle. Just speak clearly, slowly, and in a language which Waymo parses in an acceptable manner. Bororo won’t work.

Finally, Principle 7: Edit edit edit.

I think this means revisions. Those are a great idea. Alphabet Google YouTube does an outstanding job with dots, hamburger menus, and breezy writing in low contrast colors. Oh, content? If you don’t get it, you are not Googley. Speak up and you may be the Timnit treatment or the Congressional obfuscation rhetoric. I also like ignoring the antics of senior managers.

Yep, outcome centered. Great stuff. Were Messrs. Colbert and Koy imbibing something other than Sprite at the airport when possibly conjuring this list of really good tips? What’s the outcome? How about ads displayed to passengers in Waymo infused vehicles? Context centered, relevant, and a feature one cannot turn off.

Stephen E Arnold, April 8, 2021

HPE Machine Learning: A Benefit of the Autonomy Tech?

April 8, 2021

This sounds like an optimal solution from HPE (formerly known as HP); too bad it was not available back when the company evaluated the purchase of Autonomy. Network World reports, “HPE Debuts New Opportunity Engine for Fast AI Insights.” The machine-learning platform is called the Software Defined Opportunity Engine, or SDOE. It is based in the cloud, and will greatly reduce the time it takes to create custom sales proposals for HPE channel partners and customers. Citing a blog post from HPE’s Tom Black, writer Andy Patrizio explains:

“It takes a snapshot of the customer’s workloads, configuration, and usage patterns to generate a quote for the best solution for the customer in under a minute. The old method required multiple visits by resellers or HPE itself to take an inventory and gather usage data on the equipment before finally coming back with an offer. That meant weeks. SDOE uses HPE InfoSight, HPE’s database which collects system and use information from HPE’s customer installed base to automatically remediate infrastructure issues. InfoSight is primarily for technical support scenarios. Started in 2010, InfoSight has collected 1,250 trillion data points in a data lake that has been built up from HPE customers. Now HPE is using it to move beyond technical support to rapid sales prep.”

The write-up describes Black’s ah-ha moment when he realized that data could be used for this new purpose. The algorithm-drafted proposals are legally binding—HPE must have a lot of confidence in the system’s accuracy. Besides HPE’s existing database and servers, the process relies on the assessment tool recently acquired when the company snapped up CloudPhysics. We learn that the tool:

“… analyzes on-premises IT environments much in the same way as InfoSight but covers all of the competition as well. It then makes recommendations for cloud migrations, application modernization and infrastructure. The CloudPhysics data lake—which includes more than 200 trillion data samples from more than one million virtual machines—combined with HPE’s InfoSight can provide a fuller picture of their IT infrastructure and not just their HPE gear.”

As of now, SDOE is only for storage systems, but we are told that could change down the road. Black, however, was circumspect on the details.

Cynthia Murrell, April 8, 2021

GitHub: Amusing Security Management

April 8, 2021

I got a kick out of “GitHub Investigating Crypto-Mining Campaign Abusing Its Server Infrastructure.” I am not sure if the write up is spot on, but it is entertaining to think about Microsoft’s security systems struggling to identify an unwanted service running in GitHub. The write up asserts:

Code-hosting service GitHub is actively investigating a series of attacks against its cloud infrastructure that allowed cybercriminals to implant and abuse the company’s servers for illicit crypto-mining operations…

In the wake of the SolarWinds’ and Exchange Server “missteps,” Microsoft has been making noises about the tough time it has dealing with bad actors. I think one MSFT big dog said there were 1,000 hackers attacking the company.

The main idea is that attackers allegedly mine cryptocurrency on GitHub’s own servers.

This is post SolarWinds and Exchange Server “missteps”, right?

What’s the problem with cyber security systems that monitoring real time threats and uncertified processes?

Oh, I forgot. These aggressively marketed cyber systems still don’t work it seems.

Stephen E Arnold, April 8, 2021

Google and the Institutionalization of Me Too, Me Too

April 8, 2021

Never one to let a trend pass it by un-mimicked, Google has created a new YouTube feature. Ars Technica reports, “YouTube’s TikTok Clone, ‘YouTube Shorts,’ Is Live in the US.” The feature actually launched in India last September and has done well there—possibly because TikTok has been banned in that country since June. The feature but has now made its way to our shores. Writer Ron Amadeo tells us:

“The YouTube Shorts section shows up on the mobile apps section of the YouTube home screen and for now has a ‘beta’ label. It works exactly like TikTok, launching a full-screen vertical video interface, and users can swipe vertically between videos. As you’d expect, you can like, dislike, comment on, and share a short. You can also tap on a user name from the Shorts interface to see all the shorts from that user. The YouTube twist is that shorts are also regular YouTube videos and show up on traditional channel pages and in subscription feeds, where they are indistinguishable from normal videos. They have the normal YouTube interface instead of the swipey TikTok interface. This appears to be the only way to view these videos on desktop. A big part of TikTok is the video editor, which allows users to make videos with tons of effects, music, filters, and variable playback speeds that contribute to the signature TikTok video style. The YouTube Shorts editor seems nearly featureless in comparison, offering only speed options and some music.”

Absent those signature features, it seems unlikely Short will successfully rival TikTok. Perhaps it will last about as long as Stadia, Orkut, or Web Accelerator. At least no one can say Google shies away from trying things that may not work out.

Cynthia Murrell, April 8, 2021

Facebook and Microsoft: Communing with the Spirit of Security

April 7, 2021

Two apparently unrelated actions by bad actors. Two paragons of user security. Two. Count ‘em.

The first incident is summarized in “Huge Facebook Leak That Contains Information about 500 Million People Came from Abuse of Contacts Tool, Company Says.” The main point is that flawed software and bad actors were responsible. But 500 million. Where is Alex Stamos when Facebook needs guru-grade security to zoom into a challenge?

The second incident is explained in “Half a Billion LinkedIn Users Have Scraped Data Sold Online.” Microsoft, the creator of the super useful Defender security system, owns LinkedIn. (How is that migration to Azure coming along?) Microsoft has been a very minor character in the great works of 2021. These are, of course, The Taming of SolarWinds and The Rape of Exchange Server.

Now what’s my point. I think when one adds 500 million and 500 million the result is a lot of people. Assume 25 percent overlap. Well, that’s still a lot of people’s information which has taken wing.

Indifference? Carelessness? Cluelessness? A lack of governance? I would suggest that a combination of charming personal characteristics makes those responsible individuals one can trust with sensitive information.

Yep, trust and credibility. Important.

Stephen E Arnold, April 7, 2021

Alphabet Google YouTube: We Are Doing Darned Good Work

April 7, 2021

I read a peculiar item of information about the mom-and-pop outfit Alphabet Google YouTube. You may have a different reaction to the allegedly accurate data. Just navigate to “YouTube Claims It’s Getting Better at Enforcing Its Own Moderation Rules.” The “real news” story reports:

In the final months of 2020, up to 18 out of every 10,000 views on YouTube were on videos that violate the company’s policies and should have been removed before anyone watched them. That’s down from 72 out of every 10,000 views in the fourth quarter of 2017, when YouTube started tracking the figure.

Apparently the mom-and-pop outfit calculates a “violative view rate.” This is a metric possible only if a free video service accepts, indexes, and makes available “videos that contain graphic violence, scams, or hate speech.”

The system, the write up reports that :

YouTube’s team uses the figure internally to understand how well they’re doing at keeping users safe from troubling content. If it’s going up, YouTube can try to figure out what types of videos are slipping through and prioritize developing its machine learning to catch them.

A few questions spring to mind:

  • What specifically is “violative” content. An interview I conducted with a former CIA operative was removed a year after the interview appeared as a segment in my 10 to 15 minute twice monthly video news program. An interview with a retired spy was deemed violative. I hope YouTube learned something from this take down. I remain puzzled.
  • How does content depicting graphic violence, scams, and hate speech get on the YouTube system? After I upload a video, a message appears to tell me if the video is okay or not okay. I think Google’s system is getting better from the mom-and-pop outfit’s point of view. From other points of view? I am not sure.
  • Why trust metrics generated within the Alphabet Google YouTube outfit? By definition, the data collection methods, the sample, and the techniques used to identify what’s important are not revealed. FAANG-type outfits are not exactly the gold standard in ethical behavior for some people. I, of course, believe everything I read online like transcripts of senior executives’ remarks to Congressional committees?
  • Why release these data now? What’s the point? Apple is tossing cores at Facebook. Alphabet Google YouTube is reminding some that Microsoft’s security is interesting. Amazon wants to pay tax. Maybe these actions and the violative metric are PR.

The write up contains charts. Low contrast colors show just how much better Alphabet Google YouTube is getting in the violative content game. I love the violative view rate phrase. Delicious.

Stephen E Arnold, April 7, 2021

Australia Demands Fairness from Big Tech. Waves Expected Worldwide

April 7, 2021

After wrangling over the issue for weeks, Australian regulators and Facebook have come to an agreement. Regulators demanded the social media platform, as well as Google, start paying news publishers their fair share for content. Sounds reasonable, considering that out of every $100 spent on online advertising in that country, $53 goes to Google and $28 to Facebook. That is 81% going to just two companies.

Facebook responded by temporarily blocking all news to Australian users. (Google made a similar threat, but made deals with several Australian media groups instead.) Now that a compromise has been reached and the blackout ended, all that remains is for the adjusted media law to be passed. Yahoo News discusses “Why the World is Watching Australia’s Tussle with Big Tech.” Writer Andrew Beatty observes:

“Although the rules would only apply in Australia, regulators elsewhere are looking closely at whether the system works and can be applied in other countries. Microsoft — which could gain market share for its Bing search engine — has backed the proposals and explicitly called for other countries to follow Australia’s lead, arguing the tech sector needs to step up to revive independent journalism that ‘goes to the heart of our democratic freedoms’. European legislators have cited the Australian proposals favorably as they draft their own EU-wide digital market legislation. Facebook’s decision to roll back the news ban comes after it received widespread criticism for the initial blackout, which also impacted some emergency response pages used to alert the public to fires, floods and other disasters. The company quickly moved to amend that mistake, but the incident left questions about whether social media platforms should be able to unilaterally remove services that are part of crisis response and may even be considered critical infrastructure.”

Critical infrastructure—that is an interesting twist. Both Facebook and Google insist they don’t mind paying for content, something each has started to do in very limited ways. They just don’t want to be told how much to pay; Australian regulators would like independent arbiters to oversee deals to be sure they are fair. World Wide Web inventor Tim Berners-Lee warns the precedent of charging for links could “break the internet.” Are the extended consequences of holding these two companies to account really so dire?

Cynthia Murrell, April 07, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta