Friday, March 3, 2017

Searching Blob Documents with the Azure Search Service

One of the core services in the Microsoft Azure cloud platform is the Storage Service, which includes Blobs, Queues, and Table storage. Blobs are great for anything you would use a file system for, such as avatars, data files, XML/JSON files, ...and documents. But until recently, documents in blob storage had one big shortcoming: they weren't searchable. That is no longer the case. In this post, we'll examine how to search documents in blob storage using the Azure Search Service.

Azure Blob Basics

Let's quickly cover the basic of Azure blobs: 

  • Storage Accounts. To work with storage, you need to allocate a storage account in the Azure Management Portal. A storage account can be used for any or all of the following: blob storage, queue storage, table storage. We're focusing on blob storage in this article. To access a storage account you need its name and a key.
  • Containers. In your storage account, you can create one or more named containers. A container is kind of like a file folder--but without subfolders. Fortunately, there's a way to mimic subfolders (you may include slashes in blob names). Containers can be publicly accessible over the Internet, or have restricted access that requires an access key.
  • Blobs. A blob is a piece of data with a name, content, and some properties. For all intents and purposes, you can think of a blob as a file. There are actually several kinds of blobs (block blobs, append blobs, and page blobs). For our purposes here, whenever we mention blob we mean block blob, which is the type that most resembles a sequential file.

Uploading Documents to Azure Blob Storage

Let's say you had the chapters of a book you were writing in Microsoft Word that you save as pdf files--ch01.pdf, ch02.pdf, ... up to ch10.pdf, along with toc.pdf and preface.pdf--which you would like to store in blob storage and be able to search. Here's an example of what a page of this book chapter content looks like:

In your Azure storage account you can create a container (folder) for documents. In my case, I created a container named book-docs to hold my book chapter documents. In the book-docs container, you can upload your documents. If you upload the 12 pdf documents described above, you'll end up with 12 blobs (files) in your container. 

Structure of Azure Storage showing a Container and Blobs

To upload documents and get at your storage account, you'll need a storage explorer tool. You can either use my original Azure Storage Explorer or Microsoft's Azure Storage Explorer. We'll use Microsoft's explorer in this article because it has better support for one of the features we need, custom metadata properties. After downloading and launching the Storage Explorer, and configuring it to know about our storage account, this is what it looks like after creating a container and uploading 12 blobs.

12 pdf documents uploaded as blobs


Setting Document Properties

It would be nice to search these documents not only based on content, but also based on metadata. We can add metadata properties (name-value pairs) to each of these blobs. In the Microsoft Azure Storage Explorer, right-click a blob and select Properties. In the Properties dialog, click Add Metadata to add a property and enter a name and value. We'll later be able to search these properties. In my example, we've added a property named DocType and a property named Title to each document, with values like "pdf" and "Chapter 1: Cloud Computing Explained".

Blob with several metadata properties

Azure Search Basics

The Azure Search Service is able to search a variety of cloud data sources that include SQL Databases, DocumentDB, Table Storage, and Blob Storage (which is what we're interested in here). Azure Search is powered by Lucene, an open-source indexing and search technology. 

Azure Search can index both the content of blob documents and metadata properties of blobs. However, content is only indexable/searchable for supported file types: pdf, Microsoft Office (doc/docx, xls/xlsx, ppt/pttx), msg (Outlook), html, xml, zip, eml, txt, json, and csv. 

To utilize Azure Search, it will be necessary to create three entities: a Data Source, an Index, and an Indexer (don't confuse these last two). These three entities work together to make searches possible.

  • Data Source: defines the data source to be accessed. In our case, a blob container in an Azure storage account.
  • Index: the structure of an index that will be filled by scanning the data source, and queried in order to perform searches.
  • Indexer: a definition for an indexing agent, configured with a data source to scan and an index to populate.

These entities can be created and managed in the Azure Management Portal, or in code using the Azure Search REST API, or in code using the Azure Search .NET API. We'll be showing how to do it in C# code with the .NET API.


Installing the Azure Search API Package

Our code requires the Azure Search package, which is added using nuget. In Visual Studio, right-click your project and select Manage Nuget Packages. Then find and install the Microsoft Azure Search Library.You'll also need the Windows Azure Storage Library, also installed with nuget.

At the top of our code, we'll need using statements for a number of Microsoft.Azure.Search and Microsoft.WindowsAzure namespaces, and some related .NET namespaces:

using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Azure.Search;
using Microsoft.Azure.Search.Models;
using Microsoft.Azure.Search.Serialization;
using Newtonsoft.Json;
using System.ComponentModel.DataAnnotations;
using System.Web;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;

Creating a Search Service

The first step in working with Azure Search is to create a search service using the Azure Management Portal. There are several service tiers you can choose from, including a free service tier which will let you play around with 3 data sources and indexes. To work with your search service in code, you'll need its name and an API admin key, both of which you can get from the management portal. We'll be showing a fake name and key in this article, which you should replace with your actual search service name and key.

Creating a Service Client

To interact with Azure Search in our code, we need to first instantiate a service client, specifying the name and key for our search service:

string searchServiceName = "mysearchservice";
string searchServiceKey = "A65C5028BD889FA0DD2E29D0A8122F46";

SearchServiceClient serviceClient = new SearchServiceClient(searchServiceName, new SearchCredentials(searchServiceKey));

Creating a Data Source

To create a data source, we use the service client to add a new DataSource object to its DataSources collection. You'll need your storage account name and key (note this is a different credential from the search service name and key in the previous section). The following parameters are defined in the code below:

  • Name: name for the data source.
  • Type: the type of data source (AzureBlob).
  • Credentials: storage account connection string.
  • Container: identifies which container in blob storage to access.
  • DataDeletionDetectionPolicy: defines a deletion policy (soft delete), and identifies a property (Deleted) and value (1) which will be recognized as a deletion. Blobs with property Deleted:1 will be removed from the index. We'll explain more about this later.
String datasourceName = "book-docs";
if (!serviceClient.DataSources.Exists(datasourceName))
{
  serviceClient.DataSources.Create(new DataSource()
  {
    Name = datasourceName,
    Type = Microsoft.Azure.Search.Models.DataSourceType.AzureBlob,
    Credentials = new DataSourceCredentials("DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=GL3AAN0Xyy/8nvgBJcVr9lIMgCTtBeIcKuL46o/TTCpEGrReILC5z9k4m4Z/yZyYNfOeEYHEHdqxuQZmPsjoeQ=="),
    Container = new Microsoft.Azure.Search.Models.DataContainer(datasourceName),
DataDeletionDetectionPolicy = new Microsoft.Azure.Search.Models.SoftDeleteColumnDeletionDetectionPolicy() {  SoftDeleteColumnName="Deleted", SoftDeleteMarkerValue="1" }
});
}

With our data source defined, we can move on to creating our index and indexer.

Creating an Index

Next, we need to create the index that Azure Search will maintain for searches. The code below creates an index named book. It populates the Fields collection with the fields we are interested in tracking for searches. This includes:

  • content: the blob's content.
  • native metadata fields that come from accessing blob storage (such as metadata_storage_name, metadata_storage_path, metadata_storage_last_modified,  ...). 
  • custom metadata properties we've decided to add: DocType, Title, and Deleted.
Once the object is set up, it is added to the service client's Indexes collection, which creates the index.

String indexName = "book";
Index index = new Index()
{
    Name = indexName,
    Fields = new List<Field>()
};

index.Fields.Add(new Field() { Name = "content", Type = Microsoft.Azure.Search.Models.DataType.String, IsSearchable = true });
index.Fields.Add(new Field() { Name = "metadata_storage_content_type", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_storage_size", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_storage_last_modified", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_storage_content_md5", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_storage_name", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_storage_path", Type = Microsoft.Azure.Search.Models.DataType.String, IsKey = true, IsRetrievable = true , IsSearchable = true});
index.Fields.Add(new Field() { Name = "metadata_author", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_language", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "metadata_title", Type = Microsoft.Azure.Search.Models.DataType.String });
index.Fields.Add(new Field() { Name = "DocType", Type = Microsoft.Azure.Search.Models.DataType.String, IsSearchable = true });
index.Fields.Add(new Field() { Name = "Title", Type = Microsoft.Azure.Search.Models.DataType.String, IsSearchable = true });

if (serviceClient.Indexers.Exists(indexName))
{
    serviceClient.Indexers.Delete(indexName);
}
serviceClient.Indexes.Create(indexName)

Let's take note of some things about the index we're creating:
  • Some of the fields are built-in from what Azure Search intrinsically knows about blobs. This includes content and all the properties beginning with "metadata_". Especially take note of metadata_storage_path, which is the full URL of the blob. This is marked as the key of the index. This will ensure we do not receive duplicate documents in our search results.
  • Some of the fields are custom properties we've chosen to add to our blobs. This includes DocType and Title.

Creating an Indexer

And now, we can create an indexer (not to be confused with index). The indexer is the entity that will regularly scan the data source and keep the index up to date. The Indexer object identifies the data source to be scanned and the index to be updated. It also contains a schedule. In this case, the indexer will run every 30 minutes. Once the indexer object is set up, it is added to the service client's Indexers collection, which creates the indexer. In the background, the indexer will start running to scan the data source and populate the index. It's progress can be monitored using the Azure Management Portal.

String indexName = "book";
String indexerName = "book-docs";
Indexer indexer = new Indexer()
{
    Name = indexerName,
    DataSourceName = indexerName,
    TargetIndexName = indexName,
    Schedule = new IndexingSchedule()
    {
        Interval = System.TimeSpan.FromMinutes(30)
    }
};
indexer.FieldMappings = new List();
indexer.FieldMappings.Add(new Microsoft.Azure.Search.Models.FieldMapping()
{
    SourceFieldName = "metadata_storage_path",
    MappingFunction = Microsoft.Azure.Search.Models.FieldMappingFunction.Base64Encode()

});

if (serviceClient.Indexers.Exists(indexerName ))
{
    serviceClient.Indexers.Delete(indexerName );
}
serviceClient.Indexers.Create(indexer);

Let's point out some things about the indexer we're creating:

  • The indexer has a schedule, which determines how often it scans blob storage to update the index. The code above sets a schedule of every 30 minutes.
  • There is a field mapping function defined for the metadata_storage_path field, which is the document path and our unique key. Why do we need this? Well, it's possible this path value might contain characters that are invalid for an index column; to avoid failures, it is necessary to Base64-encode the value. We'll have to decode this value whenever we retrieve search results.
Putting this all together, when we run the sample included with this article it takes around half a minute to create the data source, index, and indexer. The index is initially empty, but the indexer is already running in the background and will be ready for searching in about a minute.


Creating data source, index, and indexer

Searching Blob Documents

With all of this set up underway, we're finally ready to do searches.

BlobDocument Class

As we perform searches, we're going to need a class to represent a blob document. This class needs to be aligned with how we defined our index. Our sample uses the BlobDocument class below.

[SerializePropertyNamesAsCamelCase]
public class BlobDocument
{
    [IsSearchable]
    public String content { get; set; }

    [IsSearchable]
    public String metadata_storage_name { get; set; }

    [Key]
    [IsSearchable]
    public String metadata_storage_path { get; set; }

    public String metadata_storage_last_modified { get; set; }

    public String metadata_storage_content_md5 { get; set; }

    public String metadata_author { get; set; }

    public String metadata_content_type { get; set; }

    public String metadata_language { get; set; }

    public String metadata_title { get; set; }

    [IsSearchable]
    public String DocType { get; set; }

    public String Deleted { get; set; } // A value of 1 is a soft delete

    [IsSearchable]
    public String Title { get; set; }
}

Simple Searching

A simple search simply specifies some search text, such as "cloud". 

Up until now we've been using a Service Client to set up search entities. To perform searches, we'll instead use an Index Client, which is created this way:

String indexName = "book";
ISearchIndexClient indexClient = serviceClient.Indexes.GetClient(indexName);

To perform a search, we first define what it is we want to return in our results. We'd like to know the blob document URL, its content, as well as the two custom metadata properties we defined for our blob documents, DocType and Title.

parameters = new SearchParameters()
{
    Select = new[] { "content", "DocType", "Title", "metadata_storage_path" }

};

We call the index client's Documents.Search method to perform a search for "cloud" and return results.

String searchText = "cloud";
DocumentSearchResult<BlobDocument> searchResults = indexClient.Documents.Search(searchText, parameters);

Search Results

The result of our search is a DocumentSearchResult<BlobDocument> object. We can iterate through the results using a loop. When we defined our index earlier, we had to give Azure Search instructions to Base64-encode the metadata storage path field when necessary. As a result, we now need to decode the path.

foreach (SearchResult<BlobDocument> result in searchResults.Results)
{
    Console.WriteLine("---- result ----");
  
    String path = result.Document.metadata_storage_path;
    if (!String.IsNullOrEmpty(path) && !path.Contains("/"))
    {
        path = Base64IUrlDecode(result.Document.metadata_storage_path);
    }
    Console.WriteLine("metadata_storage_path: " + path);
    Console.WriteLine("DocType: " + result.Document.DocType);
    Console.WriteLine("Title: " + result.Document.Title);
}

The path is part of the result that matters the most, because if a user is interested in a particular search result this lets them download/view the document itself. Note this is only true if your blob container is configured to permit public access.

Now that we have enough code to perform a search and view the results, let's try some simple searches. For starters, we search on "pdf". That is not a term that appears in the content of any of the documents, but it is a value in the metadata: specifically, the property Title that we added to each blob earlier. As a result, all 12 documents match:

Search for "pdf" - 12 matches to document metadata

Now, let's try a search term that should match some of the content within these documents. A search for "safe" matches 3 documents:

Search for "safe" - 3 matches to document content


More Complex Searches

We can use some special syntax in our search query to do more complex queries.

To perform an AND between two search terms, use the + operator. The query storage+security will only match documents that contain both "storage" and "security".


AND query

To perform an OR between two search terms, use the | operator. The query dangerous|roi will match documents containing "dangerous" or "ROI".


OR query

In a future post, we'll explore how to perform advanced searches. 

Deleting Documents

Normally, deleting a blob involves nothing more than selecting it in a storage explorer and clicking Delete (or doing the equivalent in code). However, with an Azure Search index it is a little more complicated: if you just summarily delete a blob that was previously in the index, it will remain in the index: the indexer will not realize the blob is now gone. This can lead to search results being returned about documents that no longer exist.

We can get around this unpleasantness by utilizing a soft delete strategy. We will define a property that means "deleted", which we will tell Azure Search about. In our case, we'll call our property Deleted. A "soft delete" will cause the blob to be removed from the index when Deleted:1 is encountered by the indexer--after which it is safe to actually delete the blob. You might consider having an overnight activity scheduled that deletes all blobs marked as deleted.

Summary

With Azure Search, documents in Azure Blob Storage are finally searchable. Using Azure Search gives you the power of Lucene without requiring you to set up and maintain it yourself, and it has the built-in capability to work with blob storage. Although there are a few areas of Azure Search that are cumbersome (notably, Base64 encoding of document URLs and handling of deleted documents), for the most part it is a joy to use: it's fast and powerful. 

You can download the full sample (VS2013 solution) here.

2 comments:

Jean Crabtree said...

Hey David, is there a suggestion list for Storage Explorer features? I'd very much like to have Cmd+A work in the Query box on the OSX version.

David Pallmann said...

Jean, sorry but I'm not Azure Storage Explorer isn't under active development right now - I just don't have the time to work on it given my current workload. Your best best is to check out Microsoft's own Azure Storage Explorer at http://storageexplorer.com/