The DataWeave Blog

Tag: Technology Stories

Redefining Product Attribute Tagging With AI-Powered Retail Domain Language Models
In online retail, success hinges on more than just offering quality products at competitive prices. As eCommerce catalogs expand and consumer expectations soar, businesses face an increasingly complex challenge: How do you effectively organize, categorize, and present your vast product assortments in a way that enhances discoverability and drives sales?

Having complete and correct product catalog data is key. Effective product attribute tagging—a crucial yet frequently undervalued capability—helps in achieving this accuracy and completeness in product catalog data. While traditional methods of tagging product attributes have long struggled with issues of scalability, consistency, accuracy, and speed, a new thinking and fundamentally new ways of addressing these challenges are getting established. These follow from the revolution brought in Large Language Models but they fashion themselves as Small Language Models (SLM) or more precisely as Domain Specific Language Models. These can be potentially considered foundational models as they solve a wide variety of downstream tasks albeit within specific domains. They are a lot more efficient and do a much better job in those tasks compared to an LLM. .

Retail Domain Language Models (RLMs) have the potential to transform the eCommerce customer journey. As always, it’s never a binary choice. In fact, LLMs can be a great starting point since they provide an enhanced semantic understanding of the world at large: they can be used to mine structured information (e.g., product attributes and values) out of unstructured data (e.g., product descriptions), create baseline domain knowledge (e.g, manufacturer-brand mappings), augment information (e.g., image to prompt), and create first cut training datasets.

Powered by cutting-edge Generative AI and RLMs, next-generation attribute tagging solutions are transforming how online retailers manage their product catalog data, optimize their assortment, and deliver superior shopping experiences. As a new paradigm in search emerges – based more on intent and outcome, powered by natural language queries and GenAI based Search Agents – the capability to create complete catalog information and rich semantics becomes increasingly critical.

In this post, we’ll explore the crucial role of attribute tagging in eCommerce, delve into the limitations of conventional tagging methods, and unveil how DataWeave’s innovative AI-driven approach is helping businesses stay ahead in the competitive digital marketplace.

Why Product Attribute Tagging is Important in eCommerce

As the eCommerce landscape continues to evolve, the importance of attribute tagging will only grow, making it a pertinent focus for forward-thinking online retailers. By investing in robust attribute tagging systems, businesses can gain a competitive edge through improved product comparisons, more accurate matching, understanding intent, and enhanced customer search experiences.

Taxonomy Comparison and Assortment Gap Analysis

Products are categorized and organized differently on different retail websites. Comparing taxonomies helps in understanding focus categories and potential gaps in assortment breadth in relation to one’s competitors: missing product categories, sizes, variants or brands. It also gives insights into the navigation patterns and information architecture of one’s competitors. This can help in making search and navigation experience more efficient by fine tuning product descriptions to include more attributes and/or adding additional relevant filters to category listing pages.

For instance, check out the different Backpack categories on Amazon and Staples in the images below.

Or look at the nomenclature of categories for “Pens” on Amazon (left side of the image) and Staples (right side of the image) in the image below.

Assortment Depth Analysis

Another big challenge in eCommerce is the lack of standardization in retailer taxonomy. This inconsistency makes it difficult to compare the depth of product assortments across different platforms effectively. For instance, to categorize smartphones,
- Retailer A might organize it under “Electronics > Mobile Phones > Smartphones”
- Retailer B could use “Technology > Phones & Accessories > Cell Phones”
- Retailer C might opt for “Consumer Electronics > Smartphones & Tablets”
Inconsistent nomenclature and grouping create a significant hurdle for businesses trying to gain a competitive edge through assortment analysis. The challenge is exacerbated if you want to do an in-depth assortment depth analysis for one or more product attributes. For instance, look at the image below to get an idea of the several attribute variations for “Desks” on Amazon and Staples.

Custom categorization through attribute tagging is essential for conducting granular assortment comparisons, allowing companies to accurately assess their product offerings against those of competitors.

Enhancing Product Matching Capabilities

Accurate product matching across different websites is fundamental for competitive pricing intelligence, especially when matching similar and substitute products. Attribute tagging and extraction play a crucial role in this process by narrowing down potential matches more effectively, enabling matching for both exact and similar products, and tagging attributes such as brand, model, color, size, and technical specifications.

For instance, when choosing to match similar products in the Sofa category for 2-3 seater sofas from Wayfair and Overstock, tagging attributes like brand, color, size, and more is a must for accurate comparisons.

Taking a granular approach not only improves pricing strategies but also helps identify gaps in product offerings and opportunities for expansion.

Fix Content Gaps and improve Product Detail Page (PDP) Content

Attribute tagging plays a vital role in enhancing PDP content by ensuring adherence to brand integrity standards and content compliance guidelines across retail platforms. Tagging attributes allows for benchmarking against competitor content, identifying catalog gaps, and enriching listings with precise details.

This strategic tagging process can highlight missing or incomplete information, enabling targeted optimizations or even complete rewrites of PDP content to improve discoverability and drive conversions. With accurate attribute tagging, businesses can ensure each product page is fully optimized to capture consumer attention and meet retail standards.

Elevating the Search Experience

In today’s online retail marketplace, a superior search experience can be the difference between a sale and a lost customer. Through in-depth attribute tagging, vendors can enable more accurate filtering to improve search result relevance and facilitate easier product discovery for consumers.

By integrating rich product attributes extracted by AI into an in-house search platform, retailers can empower customers with refined and user-friendly search functionality. Enhanced search capabilities not only boost customer satisfaction but also increase the likelihood of conversions by helping shoppers find exactly what they’re looking for more quickly and with minimal effort.

Pitfalls of Conventional Product Tagging Methods

Traditional methods of attribute tagging, such as manual and rule-based systems, have been significantly enhanced by the advent of machine learning. While these approaches may have sufficed in the past, they are increasingly proving inadequate in the face of today’s dynamic and expansive online marketplaces.

Scalability

As eCommerce catalogs expand to include thousands or even millions of products, the limitations of machine learning and rule-based tagging become glaringly apparent. As new product categories emerge, these systems struggle to keep pace, often requiring extensive revisions to existing tagging structures.

Inconsistencies and Errors

Not only is reliance on an entirely human-driven tagging process expensive, but it also introduces a significant margin for error. While machine learning can automate the tagging process, it’s not without its limitations. Errors can occur, particularly when dealing with large and diverse product catalogs.

As inventories grow more complex to handle diverse product ranges, the likelihood of conflicting or erroneous rules increases. These inconsistencies can result in poor search functionality, inaccurate product matching, and ultimately, a frustrating experience for customers, drawing away the benefits of tagging in the first place.

Speed

When product information changes or new attributes need to be added, manually updating tags across a large catalog is a time-consuming process. Slow tagging processes make it difficult for businesses to quickly adapt to emerging market trends causing significant delays in listing new products, potentially missing crucial market opportunities.

How DataWeave’s Advanced AI Capabilities Revolutionize Product Tagging

Advanced solutions leveraging RLMs and Generative AI offer promising alternatives capable of overcoming these challenges and unlocking new levels of efficiency and accuracy in product tagging.

DataWeave automates product tagging to address many of the pitfalls of other conventional methods. We offer a powerful suite of capabilities that empower businesses to take their product tagging to new heights of accuracy and scalability with our unparalleled expertise.

Our sophisticated AI system brings an advanced level of intelligence to the tagging process.

RLMs for Enhanced Semantic Understanding

Semantic Understanding of Product Descriptions

RLMs analyze the meaning and context of product descriptions rather than relying on keyword matching.
Example: “Smartphone with a 6.5-inch display” and “Phone with a 6.5-inch screen” are semantically similar, though phrased differently.

Attribute Extraction

RLMs can identify important product attributes (e.g., brand, size, color, model) even from noisy or unstructured data.
Example: Extracting “Apple” as a brand, “128GB” as storage, and “Pink” as the color from a mixed description.

Identifying Implicit Relationships

RLMs find implicit relationships between products that traditional rule-based systems miss.
Example: Recognizing that “iPhone 12 Pro” and “Apple iPhone 12” are part of the same product family.

Synonym Recognition in Product Descriptions

Synonym Matching with Context

RLMs identify when different words or phrases describe the same product.
Examples: “Sneakers” = “Running Shoes”, “Memory” = “RAM” (in electronics)
Even subtle differences in wording, like “rose gold” vs “pink” are interpreted correctly.

Overcoming Brand-Specific Terminology

Some brands use their own terminologies (e.g., “Retina Display” for Apple).
RLMs can map proprietary terms to more generic ones (e.g., Retina Display = High-Resolution Display).

Dealing with Ambiguities

RLMs analyze surrounding text to resolve ambiguities in product descriptions.
Example: Resolving “charger” to mean a “phone charger” when matched with mobile phones.

Contextual Understanding for Improved Accuracy and Precision

By leveraging advanced natural language processing (NLP), DataWeave’s AI can process and understand the context of lengthy product descriptions and customer reviews, minimizing errors that often arise at human touch points. The solution processes and interprets information to extract key information to dramatically improve the overall accuracy of product tags.

It excels at grasping the subtle differences between similar products, sizes, colors and identifying and tagging minute differences between items, ensuring that each product is uniquely and accurately represented in a retailer’s catalog.

This has a major impact on product and similarity-based matching that can even help optimize similar and substitute product matching to enhance consumer search. At the same time, our AI can understand that the same term might have different meanings in various product categories, adapting its tagging approach based on the specific context of each item.

This deep comprehension ensures that even nuanced product attributes are accurately captured and tagged for easy discoverability by consumers.

Case Study: Niche Jewelry Attributes

DataWeave’s advanced AI can assist in labeling the subtle attributes of jewelry by analyzing product images and generating prompts to describe the image. In this example, our AI identifies the unique shapes and materials of each item in the prompts.

The RLM can then extract key attributes from the prompt to generate tags. This assists in accurate product matching for searches as well as enhanced product recommendations based on similarities.

This multi-model approach provides the flexibility to adapt as product catalogs expand while remaining consistent with tagging to yield more robust results for consumers.

Unparalleled Scalability

DataWeave can rapidly scale tagging for new categories. The solution is built to handle the demands of even the largest eCommerce catalogs enabling:
- Effortless management of extensive product catalogs: We can process and tag millions of products without compromising on speed or accuracy, allowing businesses to scale without limitations.
- Automated bulk tagging: New product lines or entire categories can be tagged automatically, significantly reducing the time and resources required for catalog expansion.
Normalizing Size and Color in Fashion

Style, color, and size are the core attributes in the fashion and apparel categories. Style attributes, which include design, appearance, and overall aesthetics, can be highly specific to individual product categories.

Our product matching engine can easily handle color and sizing complexity via our AI-driven approach combined with human verification. By leveraging advanced technology to identify and normalize identical and similar products from competitors, you can optimize your pricing strategy and product assortment to remain competitive. Using Generative AI in normalizing color and size in fashion is key to powering competitive pricing intelligence at DataWeave.

Continuous Adaptation and Learning

Our solution evolves with your business, improving continuously through feedback and customization for retailers’ specific product categories. The system can be fine-tuned to understand and apply specialized tagging for niche or industry-specific product categories. This ensures that tags remain relevant and accurate across diverse catalogs and as trends emerge.

The AI in our platform also continuously learns from user interactions and feedback, refining its tagging algorithms to improve accuracy over time.

Stay Ahead of the Competition With Accurate Attribute Tagging

In the current landscape, the ability to accurately and consistently tag product attributes is no longer a luxury—it’s essential for staying competitive. With advancements in Generative AI, companies like DataWeave are revolutionizing the way product tagging is handled, ensuring that every item in a retailer’s catalog is presented with precision and depth. As shoppers demand a more intuitive, seamless experience, next-generation tagging solutions are empowering businesses to meet these expectations head-on.

DataWeave’s innovative approach to attribute tagging is more than just a technical improvement; it’s a strategic advantage in an increasingly competitive market. By leveraging AI to scale and automate tagging processes, online retailers can keep pace with expansive product assortments, manage content more effectively, and adapt swiftly to changes in consumer behavior. In doing so, they can maintain a competitive edge.

To learn more, talk to us today!
November 14, 2024
Using Siamese Networks to Power Accurate Product Matching in eCommerce
Retailers often compete on price to gain market share in high performance product categories. Brands too must ensure that their in-demand assortment is competitively priced across retailers. Commerce and digital shelf analytics solutions offer competitive pricing insights at both granular and SKU levels. Central to this intelligence gathering is a vital process: product matching.

Product matching or product mapping involves associating identical or similar products across diverse online platforms or marketplaces. The matching process leverages the capabilities of Artificial Intelligence (AI) to automatically create connections between various representations of identical or similar products. AI models create groups or clusters of products that are exactly the same or “similar” (based on some objectively defined similarity criteria) to solve different use cases for retailers and consumer brands.

Accurate product matching offers several key benefits for brands and retailers:
- Competitive Pricing: By identifying identical products across platforms, businesses can compare prices and adjust their strategies to remain competitive.
- Market Intelligence: Product matching enables brands to track their products’ performance across various retailers, providing valuable insights into market trends and consumer preferences.
- Assortment Planning: Retailers can analyze their product range against competitors, identifying gaps or opportunities in their offerings.
Why Product Matching is Incredibly Hard

But product matching stands out as one of the most demanding technical processes for commerce intelligence tools. Here’s why:

Data Complexity

Product information comes in various (multimodal) formats – text, images, and sometimes video. Each format presents its own set of challenges, from inconsistent naming conventions to varying image quality.

Data Variance

The considerable fluctuations in both data quality and quantity across diverse product categories, geographical regions, and websites introduce an additional layer of complexity to the product matching process.

Industry Specific Nuances

Industry specific nuances introduce unique challenges to product matching. Exact matching may make sense in certain verticals, such as matching part numbers in industrial equipment or identifying substitute products in pharmaceuticals. But for other industries, exactly matched products may not offer accurate comparisons.
- In the Fashion and Apparel industry, style-to-style matching, accommodating variants and distinguishing between core sizes and non-core sizes and age groups become essential for accurate results.
- In Home Improvement, the presence of unbranded products, private labels, and the preference for matching sets rather than individual items complicates the process.
- On the other hand, for grocery, product matching becomes intricate due to the distinction between item pricing and unit pricing. Managing the diverse landscape of different pack sizes, quantities, and packaging adds further layers of complexity.
Diverse Downstream Use Cases

The diverse downstream business applications give rise to various flavors of product matching tailored to meet specific needs and objectives.

In essence, while product matching is a critical component in eCommerce, its intricacies demand sophisticated solutions that address the above challenges.

To solve these challenges, at DataWeave, we’ve developed an advanced product matching system using Siamese Networks, a type of machine learning model particularly suited for comparison tasks.

Siamese Networks for Product Matching

Our methodology involves the use of ensemble deep learning architectures. In such cases, multiple AI models are trained and used simultaneously to ensure highly accurate matches. These models tackle NLP (natural language processing) and Computer Vision challenges specific to eCommerce. This technology helps us efficiently narrow down millions of product candidates to just 5-15 highly relevant matches.

The Tech Powering Siamese Networks

The key to our approach is creating what we call “embeddings” – think of these as unique digital fingerprints for each product. These embeddings are designed to capture the essence of a product in a way that makes similar products easy to identify, even when they look slightly different or have different names.

Our system learns to create these embeddings by looking at millions of product pairs. It learns to make the embeddings for similar products very close to each other while keeping the embeddings for different products far apart. This process, known as metric learning, allows our system to recognize product similarities without needing to put every product into a rigid category.

This approach is particularly powerful for eCommerce, where we often need to match products across different websites that might use different names or images for the same item. By focusing on the key features that make each product unique, our system can accurately match products even in challenging situations.

How Siamese Networks Work?

Imagine having a pair of identical twins who are experts at spotting similarities and differences. That’s essentially what a Siamese network is – a pair of identical AI systems working together to compare things.

How it works:
- Twin AI systems: Two identical AI systems look at two different products.
- Creating ‘fingerprints’ or ‘embedding’: Each system creates a unique ‘fingerprint’ of the product it’s looking at.
- Comparison: These ‘fingerprints’ are then compared to see how similar the products are.
Architecture

The architecture of a Siamese network typically consists of three main components: the shared network, the similarity metric, and the contrastive loss function.
- Shared Network: This is the ‘brain’ that creates the product ‘fingerprints’ or ‘embeddings.’ It is responsible for extracting meaningful feature representations from the input samples. This network is composed of layers of neural units that work together. Weight sharing between the twin networks ensures that the model learns to extract comparable features for similar inputs, providing a basis for comparison.
- Similarity Metric: After the shared network processes the inputs, a similarity metric is employed. This decides how alike two ‘fingerprints’ or ‘embeddings’ are. The selection of a similarity metric depends on the specific task and characteristics of the input data. Frequently used similarity metrics include the Euclidean distance, cosine similarity, or correlation coefficient, each chosen based on its suitability for the given context and desired outcomes.
- Loss Function: For training the Siamese network, a specialized loss function is used. This helps the system improve its comparison skills over time. It guides and trains the network to generate akin embeddings for similar inputs and disparate embeddings for dissimilar inputs.
  
  This is achieved by imposing penalties on the model when the distance or dissimilarity between similar pairs surpasses a designated threshold, or when the distance between dissimilar pairs falls below another predefined threshold. This training strategy ensures that the network becomes adept at discerning and encoding the desired level of similarity or dissimilarity in its learned embeddings.
How DataWeave Uses Siamese Networks for Product Matching

At DataWeave, we use Siamese Networks to match products across different retailer websites. Here’s how it works:

Pre-processing (Image Preparation)
- We collect product images from various websites.
- We clean these images up to make them easier for our AI to understand.
- We use techniques like cropping, flipping, and adjusting colors to help our AI recognize products even if the images are slightly different.
Training The AI
- We show our AI system millions of product images, teaching it to recognize similarities and differences.
- We use a special learning method called “Triplet Loss” to help our AI understand which products are the same and which are different.
- We’ve tested different AI structures to find the one that works best for product matching, including ResNet, EfficientNet, NFNet, and ViT.
Image Retrieval
- Once trained, our AI creates a unique “fingerprint” for each product image.
- We store these fingerprints in a smart database.
- When we need to find a match for a product, we:
  - Create a fingerprint for the new product.
  - Quickly search our database for the most similar fingerprints.
  - Return the top matching products.
Matches are then assigned a high or a low similarity score and segregated into “Exact Matches” or “Similar Matches.” For example, check out the image of this white shoe on the left. It has a low similarity score with the pink shoe (below) and so these SKUs are categorized as a “Similar Match.” Meanwhile, the shoe on the right is categorized as an “Exact Match.”

Similarly, in the following image of the dress for a young girl, the matched SKU has a high similarity score and so this pair is categorized as an “Exact Match.”

Siamese Networks play a pivotal role in DataWeave’s Product Matching Engine. Amid the millions of images and product descriptions online, our Siamese Networks act as an equalizing force, efficiently narrowing down millions of candidates to a curated selection of 10-15 potential matches.

In addition, these networks also find application in several other contexts at DataWeave. They are used to train our system to understand text-only data from product titles and joint multimodal content from product descriptions.

Leverage Our AI-Driven Product Matching To Get Insightful Data

In summary, accurate and efficient product matching is no longer a luxury – it’s a necessity. DataWeave’s advanced product matching solution provides brands and retailers with the tools they need to navigate this complex landscape, turning the challenge of product matching into a competitive advantage.

By leveraging cutting-edge technology and simplifying it for practical use, we empower businesses to make informed decisions, optimize their operations, and stay ahead in the ever-evolving eCommerce market. To learn more, reach out to us today!
June 26, 2024
AI-powered Product Matching: The Key to Competitive Pricing Intelligence in eCommerce
With thousands of products and hundreds of online retailers to choose from, the average modern-day shopper usually compares prices across several e-commerce sites effortlessly before often settling for the lowest priced option. As a result, retailers today are forced to execute millions of price changes per day in a never-ending race to be the lowest priced – without losing out on any potential margin.

Identifying, classifying, and matching products is the first step to comparing prices across websites. However, there is no standardization in the way products are represented across e-commerce websites, causing this process to be fairly complex.

Here’s an example:

What’s needed is a pricing intelligence solution that first matches products across several websites swiftly and accurately, and then enables automated tracking of competitor pricing data on an ongoing basis.

Pricing intelligence solutions already exist. What’s wrong with using them?

There are several challenges with the incumbent solutions in the market – the biggest one being that they don’t work in a timely manner. In essence, it’s like deferring the process of finding actionable information that helps retailers acquire a competitive advantage, and instead doing it in hindsight. Like an autopsy of sorts.

Here are the various solution types we have in the market today:
- Internally developed systems – Solutions developed by retailers themselves often rely on heavy manual data aggregation and have poor product matching capabilities. Since these solutions have been developed by professionals not attuned to building data crunching machines, they pose significant operational challenges in the form of maintenance, updates, etc.
- Web scraping solutions – These solutions have no data normalization or product matching capabilities, and lack the power to deliver relevant actionable insights. What’s more, it’s a struggle to scale them up to accommodate massive volumes of data during peak times such as promotional campaigns.
- DIY solutions – These solutions require manual research and entry of data. It goes without saying that due to the level of human intervention and effort required, they’re expensive, difficult to scale, slow, and of questionable accuracy.
As common as it is nowadays, AI has the answer

DataWeave’s competitive pricing intelligence solution is designed to help retailers achieve precisely the competitive advantage they need by providing them with accurate, timely, and actionable pricing insights enabled by matching products at scale. We provide retailers with access to detailed pricing information on millions of products across competitors, as frequently as they need it.

Our technology stack broadly consists of the following.

1. Data Aggregation

At DataWeave, we can aggregate data from diverse web sources across complex web environments – consistently and at a very high accuracy. Having been in the industry for close to a decade, we’re sitting on a lot of data that we can use to train our product matching platform.

Our datasets include data points from tens of millions of products and have been collected from numerous geographies and verticals in retail. The datasets contain hierarchically arranged information based on retail taxonomy. At the root level, there’s information such as category and subcategory, and at the top level, we have product details such as title, description, and other <attribute, value> relationships. Our machine learning architectures and semi-automated training data building systems, augmented by the skills of a strong QA team, help us annotate the necessary information and create labeled datasets using proprietary tools.

2. AI for Product Matching

Product matching at DataWeave is done via a unified platform that uses both text and image recognition capabilities to accurately identify similar SKUs across thousands of e-commerce stores and millions of products. We use an ensemble deep learning architectures tailored to NLP and Computer Vision problems specific to us and heuristics pertinent to the Retail domain. Products are also classified based on their features, and a normalization layer is designed based on various text/image-based attributes.

Our semantics layer, while technically an integral part of the product matching process, deserves particular mention due to its powerful capabilities.

The text data processing consists of internal, deep pre-trained word embeddings. We use state-of-the-art, customized word representation techniques such as ELMO, BERT, and Transformer to capture deeply contextualized text with improved accuracy. A self-attention/intra-attention mechanism learns the correlation between the word in question and a previous part of the description.

Image data processing starts with object detection to identify the region of interest of a given product (for example, the upper body of a fashion model displaying a shirt). We then leverage deep learning architectures such as VggNet, Inception-V3, and ResNet, which we have trained using millions of labeled images. Next, we apply multiple pre-processing techniques such as variable background removal, face removal, skin removal, and image quality enhancing and extract image signatures via deep learning and machine learning-based algorithms to uniquely identify products across billions of indexed products.

Finally, we efficiently distribute billions of images across multiple stores for fast access, and to facilitate searches at a massive scale (in a matter of milliseconds, without the slightest compromise on accuracy) using our image matching engine.

3. Human Intelligence in the Loop

In scenarios where the confidence scores of the machine-driven matches are low, we have a team of Quality Assurance (QA) specialists who verify the output.

This team does three things:
- Find out why the confidence score is low
- Confirm the right product matches
- Figure out a way to encode this knowledge into a rule and feed it back to the algorithm
In this way, we’ve built a self-improving feedback loop which, by its very nature, performs better over time. This system has accumulated knowledge over the 8 years of our operations, which is going to be hard for anyone to replicate. Essentially, this process enables us to match products at massive scale quickly and at very high levels of accuracy (usually over 95%).

4. Actionable Insights Via Data Visualization

Once the matching process is completed, the prices are aggregated at any frequency, enabling retailers to optimize their prices on an ongoing basis. Pricing insights are typically consumed via our SaaS-based web-portal, which consists of dashboards, reports, and visualizations.

Alternatively, we can integrate with internal analytics platforms through APIs or generate and deliver spreadsheet reports on a regular basis, depending on the preferences of our customers.

To summarize

The benefits of our solution are many. Detailed price improvement opportunity-related insights generated in a timely manner empower retailers to significantly enhance their competitive positioning across categories, product types, and brands, as well as ability to influence their price perception among consumers. These insights, when leveraged at a higher granularity over the long term, can help maximize revenue through price optimization at a large scale.

Our solution also helps drive process-based as well as operational optimizations for retailers. Such modifications help them better align themselves to effectively adopt a data-driven approach to pricing, in turn helping them achieve much smarter retail operations across the board.

All of this wouldn’t be possible if the product matching process, inherent to this system, was unreliable, expensive, or time-consuming.

If you would like to learn more about DataWeave’s proprietary product matching platform and the benefits it offers to eCommerce businesses and brands, talk to us now!
December 29, 2023
AI-Driven Mapping of Retail Taxonomies- Part 2
Mapping product taxonomies using Deep Learning

In Part 1 we discussed the importance of Retail taxonomy and the applications of mapping retail taxonomies in Assortment Analytics, building Knowledge Graph, etc. Here, we will discuss how we approached the problem of mapping retail taxonomies across sources.

We solved this problem by classifying every retail product to a standard DataWeave defined taxonomy so that products from different websites could be brought at the same level. Once these products are at the same level, mapping taxonomies becomes straightforward.

We’ve built an AI-based solution that uses state-of-the-art algorithms to predict the correct DataWeave Taxonomy for a product from its textual information like Title, Taxonomy and Description. Our model predicts a standard 4 level (L1-L2-L3-L4) taxonomy for any given product. These Levels denote Category, Sub Category, Parent Product Type and Product Type respectively.

Approach

Conventional methods for taxonomy prediction are typically based on machine learning classification algorithms. Here, we need to provide textual data and the classifier will predict the entire taxonomy as a class.

We used the classification approach as a baseline, but found a few inherent flaws in this:
- A Classification model cannot understand the semantic relation between input text and output hierarchy. Which means, it cannot understand if there’s any relation between the textual input and the text present in the taxonomy. For a classifier, the output class is just a label encoded value
- Since the taxonomy is a tree and each leaf node uniquely defines a path from the root to leaf, the classification algorithms effectively output an existing root-to-leaf path. However, it cannot predict new relationships in the tree structure
- Let’s say, our training set has only the records for “Clothing, Shoes & Jewelry > Men > Clothing > Shorts” and “Clothing, Shoes & Jewelry > Baby > Shoes > Boots”, Example:
{‘title’: “Russell Athletic Men’s Cotton Baseline Short with Pockets – Black – XXX-Large”,

‘dw_taxonomy’: “ Clothing, Shoes & Jewelry > Men > Clothing > Shorts”},

{‘title’:” Surprise by Stride Rite Baby Boys Branly Faux-Leather Ankle Boots(Infant/Toddler) – Brown -”,

’dw_taxonomy:” Clothing, Shoes & Jewelry > Baby > Shoes > Boots”}

Now, if a product with Title “Burt’s Bees Baby Baby Boys’ Terry Short” comes for prediction, then the classifier will never be able to predict the correct taxonomy. Although, it would have seen the data points of Shorts and Baby.

E-commerce product taxonomy has a very long tail, i.e. there’s a huge imbalance in counts of data per taxonomy. Classification algorithms do not perform well for very long tail problems.

Encoder-Decoder with Attention for Taxonomy Classification

What is Encoder-Decoder?

Encoder-Decoder is a classical Deep Learning architecture where there are two Deep Neural Nets, an Encoder and a Decoder linked with each other to generate desired outputs.

The objective of an Encoder is to encode the required information from the input data and store it in a feature vector. In case of text input, the encoder is mostly an RNN or Transformer based architecture and for image input, it is mostly a CNN-based architecture. Once the encoded feature vector is created, the Decoder uses it to produce the required output. The Encoder and Decoder can be interfaced by another layer which is called Attention. The Role of Attention mechanism is to train the model to selectively focus on useful parts of the input data and hence, learn the alignment between them. This helps the model to cope effectively with long input sentences (when dealing with text) or complex portions of images (when input is an image).

Instead of classification-based approaches, we use an Encoder-Decoder architecture and map the problem of taxonomy classification to the task of machine translation (MT) AKA, Seq2Seq. An MT system takes the text in one language as input and outputs its translation as a sequence of words in another language. In our case, the input maps to the textual description of a product, and the output maps to the sequence of categories and sub-categories in our taxonomy (e.g., Clothing, Shoes & Jewelry > Baby > Shoes > Boots). By framing taxonomy classification as an MT problem, we overcome a lot of limitations present in classical classification approaches.
- This architecture has the capability to predict a taxonomy that is not even present in the training data.
  - Talking about the example we discussed earlier where a traditional classification model was not able to predict the taxonomy for “Baby Boys knit terry shorts – cat & jack gray 12 m”, this Encoder-decoder model easily predicts the correct taxonomy as “ Clothing, Shoes & Jewelry > Baby > Clothing > Shorts”
- We achieved a much higher accuracy because the model understands the semantic relationship between the input and output text, as well as giving attention to the most relevant parts in the input, when generating the output
Fig. Attention visualization for product title “South of France lavender fields Bar Soap”. It can be seen from the image that the attention weights of “soap” word is very high when predicting the output at different time-steps.

We used pre-trained fasttext word embeddings to vectorize textual input, pass on to the GRU-RNN based encoder which processes the input sequentially, and generates the final encoded vector. The Decoder which is also a GRU-RNN takes this encoded input and generates the output sequentially. Along with the encoded vector, there is also an attention vector which is passed to the Decoder for the output at every time-step.

We trained both the Classification model (Baseline) and the Encoder-Decoder model for the Fashion category and the Beauty & Personal Care category.

For Fashion, we trained the model with 170,000 data points and validated it on a 30k set. For Beauty Category, we trained the model on 88k data points and validated it on a 20k set. We were able to achieve 92% Seq2Seq accuracy in 1,240 classes for the Fashion category and 96% Seq2Seq accuracy in 343 classes for the Beauty Category, using the Encoder-Decoder approach.

Summary and the Way Forward

Since we moved to this approach, we have seen drastic improvements in the accuracy of our Assortment Intelligence accounts. But the road doesn’t end here. There are several challenges to be tackled and worked upon. We’re planning on making this process language agnostic by using cross-lingual embeddings, merging models from different categories and also using product Image to complement the text-based model with visual input via a Multi-Modal approach.

References

Don’t Classify, Translate: Multi-Level E-Commerce Product Categorization Via Machine Translation by Maggie Yundi Li, Stanley Kok and Liling Tan

SIGIR eCom’18 Data Challenge is organized by Rakuten Institute of Technology Boston (RIT-Boston)

Massive Exploration of Neural Machine Translation Architectures by Denny Britz, Anna Goldie, Minh-Thang, and Luong Quoc Le
January 13, 2021
Mapping eCommerce Product Taxonomy with AI Pt. 1
Product Taxonomy and its importance in retail

Every product on a retail website is categorized in such a way that it denotes where the product belongs in the entire catalog. Generally, these categorizations follow a hierarchy that puts the product under some Category, Subcategory and Product Type (Ex. Clothing, Shoes & Jewelry > Men > Clothing > Shirts). We call this hierarchical eCommerce product categorization as Product Taxonomy. Categorizing products in a logical manner – in a way a shopper would find intuitive, helps in navigation when he or she is browsing an e-commerce website.

In addition, with a good category organization, a product lends itself for better searchability (for search engines) on e-commerce websites. Search engines work by looking up query terms in an index which points to products which contain those terms. Matches in various fields are ranked differently in relevance.

For instance, a term that matches a word in the title, indicates greater relevance compared to one which matches the description. Additionally, terms that are exclusive to certain products, signal greater selectivity and hence contribute more to ranking. In light of this, the choice of words in fields indicating a product’s category affects the relevance of search results for a user query. This improves discoverability and as relevant results show up, it in turn improves the user experience. A good product taxonomy contributes to increased sales by helping shoppers find relevant products while browsing or searching.

Retail websites organize products into a taxonomy which they deem intuitive for their users, and fits the organization of their business units. Different retail websites could thus have taxonomies varying significantly from each other. Since we deal with millions of products across hundreds of websites on a daily basis, we often have to work with various taxonomies for the same product coming from different websites.

We are required to align these to a common standard taxonomy for our analyses. Standard taxonomies like Global Product Classification (GPC) taxonomies and Google Product taxonomies offer a standard way of representing a product. However, none of these taxonomies are complete and generic. Hence, we at DataWeave have come up with our own Standard Taxonomies for each category in e-commerce, which are generic enough to represent products on websites across different geographies.

Having a standard taxonomy for each retail product is important for our Data Orchestration pipeline. A Standard Taxonomy helps in enriching the DataWeave Retail Knowledge Graph at scale.

DataWeave’s Retail Knowledge Graph

The information about products on most of the retail websites is unstructured and broken. We process this unstructured data, derive structured information from it and store it in a connected format in our Knowledge Graph. The Knowledge Graph is used in downstream applications like Attribute Tagging, Content Analysis, etc. The Knowledge Graph follows a standard hierarchy of 4 levels (L1 > L2 > L3 > L4) for all the retail products.

Mapping eCommerce retail taxonomies is not only a requirement for the Knowledge Graph, but has some direct business applications as well:

Assortment Analytics
- Mapping competitors’ products to their own taxonomies help retailers understand the exact gap in their assortment, regardless of how competitors are categorizing their products
- Let’s say a retailer is interested in knowing the assortment of a product type, Scented Candles in their competitor’s catalog. Now, the retailer might have categorized it as Home & Kitchen > Home Decor > Scented Candle but the same product type could have been categorized as Fragrance > Home > Candles on a competitor’s website. Here, having an efficient and scalable mechanism to map product taxonomies provides accurate assortment analytics which retailers look for. Example:
Health & Household > Health Care > Alternative Medicine > Aromatherapy > Candles

Fragrance > Candles & Home Scents > Candles

Automated Catalog Suggestion

It is also used in Catalog Suggestion as a Service, where for any product we suggest the appropriate taxonomy it should follow on the website for a better browsing experience.

Stay tuned to Part-2 to know how we are solving the problem of mapping various retail taxonomies.

Click here to know more about assortment analytics
January 6, 2021
Market Intelligence Platform with Kenshoo

We’re thrilled to announce that we have teamed up with Kenshoo to offer an integrated marketing solution that combines DataWeave’s digital shelf analytics and commerce intelligence platform with Kenshoo’s ad automation platform. This in turn, provides better recommendations on promotions to retailers and consumer brands.

As e-commerce surges, consumer brands can now promote their products through retail-intelligent advertising. Product discoverability, content audit, and availability across large marketplaces can be critical to a brand’s success. Using DataWeave’s digital shelf solutions, Kenshoo now can offer marketers greater visibility into a brand’s performance.

Even large retailers and agencies can use our commerce intelligence platform to improve their price positioning, address category assortment gaps, and more.

Through this partnership, Kenshoo – a global leader in marketing technology, can help its significant base of consumer brands and retailers invest their marketing dollars intelligently and in a timely manner.

At DataWeave, we have constantly strived to bring in a holistic approach to help our customers optimize their online sales channels. This partnership furthers our resolve in this direction. As we collectively strive to adjust to a post-COVID-19 world, we are observing an acceleration towards digital commerce. This acceleration and change in consumer behavior is going to be a lasting change, creating significant growth opportunities for both DataWeave and Kenshoo.

With this partnership, we look forward to helping our customers make timely, intelligent, and data-driven decisions to grow their business.

July 22, 2020
How Apache Airflow Optimizes Complex Workflows in DataWeave’s Technology Platform
As successful businesses grow, they add a large number of people, customers, tools, technologies, etc. and roll out processes to manage the ever-increasingly complex landscape. Automation ensures that these processes are run in a smooth, efficient, swift, accurate, and cost-effective manner. To this end, Workflow Management Systems (WMS) aid businesses in rolling out an automated platform that manages and optimizes business processes at large scale.

While workflow management, in itself, is a fairly intricate undertaking, the eventual improvements in productivity and effectiveness far outweigh the effort and costs.

At DataWeave, on a normal day, we collect, process and generate business insights on terabytes of data for our retail and brand customers. Our core data pipeline ensures consistent data availability for all downstream processes including our proprietary AI/ ML layer. While the data pipeline itself is generic and serves standard workflows, there has been a steady surge in customer-specific use case complexities and the variety of product offerings over the last few years.

A few months ago, we recognized the need for an orchestration engine. This engine would serve to manage the vast volumes of data received from customers, capture data from competitor websites (which can range in complexity and from 2 to 130+ in number), run the required data transformations, execute the product matching algorithm through our AI systems, process the output through a layer of human verification, generate actionable business insights, feed the insights to reports and dashboards, and more. In addition, this engine would be required to help us manage the diverse customer use cases in a consistent way.

As a result, we launched a hunt for a suitable WMS. We needed the system to satisfy several criteria:
- Ability to manage our complex pipeline, which has several integrations and tech dependencies
- Extendable system that enables us to operate with multiple types of databases, internal apps, utilities, and APIs
- Plug and play interfaces to execute custom scripts, and QA options at each step
- Operates with all cloud services
- Addresses the needs of both ‘Batch’ and ‘Near Real Time’ processes
- Generates meaningful feedback and stats at every step of the workflow
- Helps us get away with numerous crontabs, which are hard to manage
- Execute workflows repeatedly in a consistent and precise manner
- Ability to combine multiple complex workflows and conditional branching of workflows
- Provides integrations with our internal project tracking and messaging tools such as, Slack and Jira, for immediate visibility and escalations
- A fallback mechanism at each step, in case of any subsystem failures.
- Fits within our existing landscape and doesn’t mandate significant alterations
- Should support autoscaling since we have varying workloads (the system should scale the worker nodes on-demand)
On evaluating several WMS providers, we zeroed in on Apache Airflow. Airflow satisfies most of our needs mentioned above, and we’ve already onboarded tens of enterprise customer workflows onto the platform.

In the following sections, we will cover our Apache Airflow implementation and some of the best practices associated with it.

DataWeave’s Implementation

Components

Broker: A 3 node Rabbit-MQ cluster for high availability. There are 2 separate queues maintained, one for SubDags and one for tasks, as SubDags are very lightweight processes. While they occupy a worker slot, they don’t do any meaningful work apart from waiting for their tasks to complete.

Meta-DB: MetaDB is one of the most crucial components of Airflow. We use RDS-MySQL for the managed database.

Controller: The controller consists of the scheduler, web server, file server, and the canary dag. This is hosted in a public subnet.

Scheduler and Webserver: The scheduler and webserver are part of the standard airflow services.

File Server: Nginx is used as a file server to serve airflow logs and application logs.

Canary DAG: The canary DAG mimics the actual load on our workers. It runs every 30 minutes and checks the health of the scheduler and the workers. If the task is not queued at all or has spent more time in the queued state than expected, then either the scheduler or the worker is not functioning as expected. This will trigger an alert.

Workers: The workers are placed in a private subnet. A general-purpose AWS machine with two types of workers is configured, one for sub-DAGs and one for tasks. The workers are placed in an EC2-Autoscaling group and the size of the group will either grow or shrink depending on the current tasks that are executed.

Autoscaling of workers

Increasing the group size: A lambda is triggered in a periodic interval and it checks the length of the RMQ queue. The lambda also knows the current number of workers in the current fleet of workers. Along with that, we also log the average run time of tasks in the DAG. Based on these parameters, we either increase or decrease the group size of the cluster.

Reducing the group size: When we decrease the number of workers, it also means any of the workers can be taken down and the worker needs to be able to handle it. This is done through termination hooks. We follow an aggressive scale-up policy and a conservative scale-down policy.

File System: We use EFS (Elastic File System) of AWS as the file system that is shared between the workers and the controller. EFS is a managed NAS that can be mounted on multiple services. By using EFS, we have ensured that all the logs are present in one file system and these logs are accessible from the file server present in the controller. We have put in place a lifecycle policy on EFS to archive data older than 7 days.

Interfaces: To scale up the computing platform when required, we have a bunch of hooks, libraries, and operators to interact with external systems like Slack, EMR, Jira, S3, Qubole, Athena, and DynamoDB. Standard interfaces like Jira and Slack also help in onboarding the L2 support team. The L2 support relies on Jira and Slack notifications to monitor the DAG progress.

Deployment

Deployment in an airflow system is fairly challenging and involves multi-stage deployments.

Challenges:
- If we first deploy the controller and if there are any changes in the DAG, the corresponding tasks may not be present in workers. This may lead to a failure.
- We have to make blue-green deployments as we cannot deploy on the workers where tasks may still be running. Once the worker deployments are done, the controller deployment takes place. If it fails for any reason, both the deployments will be rolled back.
We use an AWS code-deploy to perform these activities.

Staging and Development

For development, we use a docker container from Puckel-Airflow. We have made certain modifications to change the user_id and also to run multiple docker containers on the same system. This will help us to test all the new functionality at a DAG level.

The staging environment is exactly like the development environment, wherein we have isolated our entire setup in separate VPCs, IAM policies, S3-Buckets, Athena DBs, Meta-DBs, etc. This is done to ensure the staging environment doesn’t interfere with our production systems. The staging setup is also used to test the infra-level changes like autoscaling policy, SLAs, etc.

In Summary

Following the deployment of Apache Airflow, we have onboarded several enterprise customers across our product suite and seen up to a 4X improvement in productivity, consistency and efficiency. We have also built a sufficient set of common libraries, connectors, and validation rules over time, which takes care of most of our custom, customer-specific needs. This has enabled us to roll out our solutions much faster and with better ROI.
As Airflow has been integrated to our communications and project tracking systems, we now have much faster and better visibility on current statuses, issues with sub processes, and duration-based automation processes for escalations.
At the heart of all the benefits we’ve derived is the fact that we have now achieved much higher consistency in processing large volumes of diverse data, which is one of DataWeave’s key differentiators.
In subsequent blog posts, we will dive deeper into specific areas of this architecture to provide more details. Stay tuned!
May 13, 2020
Flaunt Your Deep-Tech Prowess at Bootstrap Paradox Hackathon Hosted by Blume Ventures
When DataWeave was founded in 2011, we set out to democratize data by enabling businesses to leverage public Web data to solve mission-critical business problems. Eight years on, we have done just that, and grown to deliver AI-powered competitive intelligence and digital shelf analytics to several global retailers and brands, which include the likes of Adidas, QVC, Overstock, Sauder, Dorel, and more.

As the company has grown, so has our team, which is now 140+ members strong. We’re still constantly on the lookout for smart, open, and driven folks to join us and contribute to our success.

And so, we’re excited to partner with Skillenza and Blume Ventures to co-host the Bootstrap Paradox Hackathon, where we are eager to engage with the developer community and contribute in our own way back to the startup ecosystem.

The event will be conducted as an offline product building competition, with a duration of 24 hours on August 3-4, 2019 at the Microsoft India office in Bengaluru. It will provide a platform for developers and coders to interact with and solve challenges thrown up by DataWeave and other Blume portfolio companies, such as Dunzo, Unacademy, Milkbasket, Mechmocha, and Locus.

Taking up DataWeave’s challenge during this Hackathon will give you a sneak peek into what our team works on daily. It’s no surprise that we have “At DataWeave, it’s a Hackathon every day!” plastered on our walls. After all, it’s not just all about intense work, but also a lot of fun and frolic.

The problems that we deal with are as exciting as they are hard. Some of our key accomplishments in technology include:
- Matching products across e-commerce websites at massive scale and at high levels of accuracy and coverage
- Using Computer Vision to detect product attributes in fashion such as a color, sleeve length, collar type, etc. by analyzing catalog images
- Aggregating data from complex web environments, including mobile apps, and across 25+ international languages
One of our more recent innovations has been in optimizing e-commerce product discovery engines, which dramatically improves shopper experience and purchase conversion rates. During the Bootstrap Paradox Hackathon, coders will get a chance to build a similar engine, with guidance and assistance from DataWeave’s technology leaders.

Data sets containing product information like title, description, image URL, price, category etc. will be provided, and coders will need to clean up the data, extract information on relevant product attributes and features, and index them, in the process of building the product discovery engine.

For more details on the challenge, register here on the Skillenza platform.

As a sweetener, the event also promises everyone a chance to win over 10 lakhs in prize money.

Simply put, if you love code, this is the place to be this weekend. See you there!
August 1, 2019
Compete Profitably in Retail: Leveraging AI-Powered Competitive Intelligence at Massive Scale

AI is everywhere. Any retailer worth his salt knows that in today’s hyper-competitive environment, you can’t win just by fighting hard – you have to do it by fighting smart. The solution? Retailers are turning to AI in droves.

The problem is that many organizations regard AI as a black box of sorts – where you can throw all your data (the digital era’s blessing that feels like a curse) in at one end and have miraculously meaningful output appearing out the other. The reality of how AI works, however, is a lot more complex. It takes a lot of work to make AI work for you – and then to derive value out of it.

Image Source: https://xkcd.com/1838

Following the advent of the digital era, businesses across industries, particularly retail, were left grappling with massive amounts of internal data. To make things worse, this data was unstructured and siloed, making it difficult to process effectively. Yet, businesses learned to leverage simple analytics to extract relevant data and insights to affect smarter decisions.

But just as that happened, the e-commerce revolution stirred things up again. As businesses of all shapes, sizes, and types moved online, they suddenly became a whole lot more vulnerable to other players’ movements than they were just about a decade ago, when buyers rarely visited more than one store before they made a purchase. In other words, retailers are now operating in entire ecosystems – with consumers evaluating a number of retailers before making a purchase, and a disproportionate number of players vying for the same consumer mindshare and share of wallet.

Thus, external data from the web – the largest source of data known to man at present – is becoming critical to business’ ability to compete profitably in the market.

Competing profitably in the digital era: Can AI help?

As organizations across industries and geographies increasingly realized that their business decisions were affected by what’s happening around them (such as competitors’ pricing and merchandize decisions), they started shifting away from their excessive obsession with internal data, and began to look for ways to gather external data, integrate it with their internal data, and process it all in entirety to derive wholesome, meaningful insights.

Simply put, harnessing external data consistently and on a large scale is the only way for businesses to gain a sustainable competitive advantage in the retail market. And the only way to practically accomplish that is with the help of AI. Many global giants are already doing this – they’re analyzing loads of external data every minute to take smarter decisions.

That said, though, what you need to know is that all this data, while publicly available and therefore accessible, is massive, unstructured, noisy, scattered, dynamic, and incomplete. There’s no algorithm in the world that can start working on it overnight to churn out valuable insights. AI can only be effective if enormous amounts of training data is constantly fed back into it, coaxing it to get better and more astute each time. However, given the scarcity of readily available training datasets, limited and unreliable access to domain-specific data, and the inconsistent nature of the data itself, a majority of AI initiatives have ended up in a “garbage in, garbage out” loop that they can’t break out of.

What you need is the perfect storm

At DataWeave, we understand the challenge of blindly dealing with data at such a daunting scale. We get that what you need is a practical way to apply AI to the abundant web data out there and generate specific, relevant, and actionable insights that enable you to make the right decisions at the right time. That’s why we’ve developed a system that runs on a human-aided-machine-intelligence driven virtuous loop, ensuring better, sharper outcomes each time.

Our technology platform includes four modules:

1. Data aggregation: Here, we capture public web data at scale – whatever format, size, or shape it’s in – by deploying a variety of techniques.

2. AI-driven analytics: Since the gathered data is extremely raw, it’s cleaned, curated, and normalized to remove the noise and prepare it for the AI layer, which then analyzes the data and generates insights.

3. Human-supervised feedback: Though AI is getting smarter with time, we see that it’s still far from human cognitive capabilities – so we’ve introduced a human in the loop to validate the AI-generated insights, and use this as training data that gets fed back to the AI layer. Essentially, we use human intelligence to make AI smarter.

4. Data-driven decision-making: Once the data has been analyzed and the insights generated, they can either be used as it to drive decision-making, or then integrated with internal data for decision-making at a higher level.

With intelligent, data-backed decision-making capabilities, you can outperform your competitors

Understandably, pricing is one of the most popular applications of data analytics in retail. For instance, a leading, US-based online furniture retailer approached us with the mission-critical challenge of pricing products just right to maximize sell-through rates as well as gross margin in a cost-effective and sustainable manner. We matched about 2.5 million SKUs across 75 competitor websites using AI and captured pricing, discounts, and stock status data every day. As a result, we were able to affect an up to 30% average increase in the sales of the products tracked, and up to a 3x increase in their gross margin.

DataWeave’s powerful AI-driven platform is essentially an engine that can help you aggregate and process external data at scale and in near-real time to manage unavoidably high competition and margin pressures by enabling much sharper business decisions than before. The potential applications for the resulting insights are diverse – ranging from pricing, merchandize optimization, determination of customer perception, brand governance, and business performance analysis.

If you’d like to learn more about our unique approach to AI-driven competitive intelligence in retail, reach out to us for a demo today!

June 13, 2019
Recognize Product Attributes with AI-Powered Image Analytics
Anna is a fashionista and a merchandise manager at a large fast-fashion retailer. As part of her job, she regularly browses through the Web for the most popular designs and trends in contemporary fashion, so she can augment her product assortment with fresh and fast-moving products.

She spots a picture on social media of a fashion blogger sporting a mustard colored, full-sleeved, woolen coat, a yellow sweatshirt, purple polyester leggings, and a pair of pink sneakers with laces. She finds that the picture has garnered several thousand “likes” and several hundred “shares”. She also sees that a few other online fashion influencers have blogged about similar styles in coats and shoes being in vogue.

Anna thinks it’s a good idea to house a selection of similar clothing and accessories for the next few weeks, before the trend dies down.

But, she is in a bit of a pickle.

Different brands represent their catalog differently. Some have only minimalistic text-based product categorization, while others are more detailed. The ones that are detailed don’t categorize products in a way that helps her narrow down her consideration set. Product images, too, lack standardization as each brand has its own visual merchandising norms and practices.

Poring through thousands of products across hundreds of brands, looking for similar products is time-consuming and debilitating for Anna, restricting her ability to spend time on higher-value activities. Luckily, at DataWeave, we’ve come across several merchandise managers facing challenges like hers, and we can help.

AI-powered product attribute tagging in fashion

DataWeave’s AI-powered, purpose-built Fashion Tagger automatically assigns labels to attributes of fashion products at great granularity. For example, on processing the image of the blogger described earlier, our algorithm generated the following output.

Original Image Source: Rockpaperdresses.dk

Vision beyond the obvious

Training machines is hard. While modern computers can “see” as well as any human, the difference lies in their lack of ability to perceive or interpret what they see.

This can be compared to a philistine at a modern art gallery. While he or she could quite easily identify the colors and shapes in the paintings, additional instructions would be needed on how the painting can be interpreted, evaluated, and appreciated.

While machines haven’t gotten that far yet, our image analytics platform is highly advanced, capable of identifying and interpreting complex patterns and attributes in images of clothing and fashion accessories. Our machines recognize various fashion attributes by processing both image- and associated text-based information available for a product.

Here’s how it’s done:
- With a single glance of its surroundings, the human eye can identify and localize each object within its field of view. We train our machines to mimic this capability using neural-network-based object detection and segmentation. As a result, our system is sensitive to varied backgrounds, human poses, skin exposure levels, and more, which are quite common for images in fashion retail.
- The image is then converted to 0s and 1s, and fed into our home-brewed convolutional neural network trained on millions of images with several variations. These images were acquired from diverse sources on the Web, such as user-generated content (UGC), social media, fashion shows, and hundreds of eCommerce websites around the world.
- If present, text-based information associated with images, like product title, metadata, and product descriptions are used to enhance the accuracy of the output and leverage non-visual cues for the product, like the type of fabric. Natural-language processing, normalization and several other text processing techniques are applied here. In these scenarios, the text and image pipelines are merged based on assigned weightages and priorities to generate the final list of product attributes.
The Technology Pipeline

Our Fashion Tagger can process most clothing types in fashion retail, including casual wear, sportswear, footwear, bags, sunglasses and other accessories. The complete catalog of clothing types we support is indicated in the image below.

Product Types Processed and Classified by DataWeave

One product, several solutions

Across the globe, our customers in fast-fashion wield our technology every day to compare their product assortment against their competitors. Our SaaS-based portal provides highly granular product-attribute-wise comparisons and tracking of competitors’ products, enabling our customers to spot assortment gaps of in-demand and trending products, as well as to better capitalize on the strengths in their assortment.

Some other popular use cases include:
- Similar product recommendations: This intelligent product recommendation engine can help retailers identify and recommend to their shoppers, products with similar attributes to the one they’re looking at, which can potentially help drive higher sales. For example, they can recommend alternatives to out-of-stock products, so customers don’t bounce off their website easily.
- Ensemble recommendations: Our proprietary machine-learning based algorithms analyze images on credible fashion blogs and websites to learn the trendiest combinations of products worn by online influencers, helping retailers recommend complementary products and drive more value. Combining this with insights on customer behavior can generate personalized ensemble recommendations. It’s almost like providing a personal stylist for shoppers!
- Diverse styling options: The same outfit can often be worn in several different ways, and shoppers typically like to experiment with unconventional modes of styling. Our technology helps retailers create “lookbooks” that provide real world examples of multiple ways a particular piece of clothing can be worn, adding another layer to the customer’s shopping experience.
- Search by image: Shoppers can search for products similar to ones worn by celebrities and other influencers through an option to “Search by Image”, which is possible due to our technology’s ability to automatically identify product attributes and find similar matches.
- Fast-fashion trend analysis: Retailers can study emerging trends in fashion and host them in their product assortment before anyone else.
The devil is in the details

DataWeave’s Fashion Tagger guarantees very high levels of accuracy. Our unique human-in-the-loop approach combines the power of machine-learning-based algorithms with human intelligence to accurately differentiate between similar product attributes, such as between boat, scoop and round necks in T-shirts.

This system is a closed feedback loop, in which a large amount of ground-truth (manually verified) data is generated by in-house teams, which power the algorithms. In this way, the machine-generated output gets more and more accurate with time, which goes a long way in our ability to swiftly deliver insights at massive scale.

In summary, DataWeave’s Image Analytics platform is driven by: enormous amount of training data + algorithms + infrastructure + humans-in-loop.

If you’re intrigued by DataWeave’s technology and wish to know more about how we help fashion retailers compete more effectively, check us out on our website!
April 16, 2018
Dataweave – CherryPy vs Sanic: Which Python API Framework is Faster?
Rest APIs play a crucial role in the exchange of data between internal systems of an enterprise, or when connecting with external services.

When an organization relies on APIs to deliver a service to its clients, the APIs’ performance is crucial, and can make or break the success of the service. It is, therefore, essential to consider and choose an appropriate API framework during the design phase of development. Benefits of choosing the right API framework include the ability to deploy applications at scale, ensuring agility of performance, and future-proofing front-end technologies.

At DataWeave, we provide Competitive Intelligence as a Service to retailers and consumer brands by aggregating Web data at scale and distilling them to produce actionable competitive insights. To this end, our proprietary data aggregation and analysis platform captures and compiles over a hundred million data points from the Web each day. Sure enough, our platform relies on APIs to deliver data and insights to our customers, as well as for communication between internal subsystems.

Some Python REST API frameworks we use are:
- Tornado — which supports asynchronous requests
- CherryPy — which is multi-threaded
- Flask-Gunicorn — which enables easy worker management
It is essential to evaluate API frameworks depending on the demands of your tech platforms and your objectives. At DataWeave, we assess them based on their speed and their ability to support high concurrency. So far, we’ve been using CherryPy, a widely used framework, which has served us well.

CherryPy

An easy to use API framework, Cherrypy does not require complex customizations, runs out of the box, and supports concurrency. At DataWeave, we rely on CherryPy to access configurations, serve data to and from different datastores, and deliver customized insights to our customers. So far, this framework has displayed very impressive performance.

However, a couple of months ago, we were in the process of migrating to python 3 (from python 2), opening doors to a new API framework written exclusively for python 3 — Sanic.

Sanic

Sanic uses the same framework that libuv uses, and hence is a good contender for being fast.

(Libuv is an asynchronous event handler, and one of the reasons for its agility is its ability to handle asynchronous events through callbacks. More info on libuv can be found here)

In fact, Sanic is reported to be one of the fastest API frameworks in the world today, and uses the same event handler framework as nodejs, which is known to serve fast APIs. More information on Sanic can be found here.

So we asked ourselves, should we move from CherryPy to Sanic?

Before jumping on the hype bandwagon, we looked to first benchmark Sanic with CherryPy.

CherryPy vs Sanic

Objective

Benchmark CherryPy and Sanic to process 500 concurrent requests, at a rate of 3500 requests per second.

Test Setup
```
Machine configuration: 4 VCPUs/ 8GB RAM.
Network Cloud: GCE
Number of Cherrypy/Sanic APIs: 3 (inserting data into 3 topics of a Kafka cluster)
Testing tool : apache benchmarking (ab)
Payload size: All requests are POST requests with 2.1KB of payload.
```
API Details
```
Sanic: In Async mode
Cherrypy: 10 concurrent threads in each API — a total of 30 concurrent threads
Concurrency: Tested APIs at various concurrency levels. The concurrency varied between 10 and 500
Number of requests: 1,00,000
```
Results

Requests Completion: A lower mean and a lower spread indicate better performance

Observation

When the concurrency is as low as 10, there is not much difference between the performance of the two API frameworks. However, as the concurrency increases, Sanic’s performance becomes more predictable, and the API framework functions with lower response times.

Requests / Second: Higher values indicate faster performance

Sanic clearly achieves higher requests/second because:
- Sanic is running in Async mode
- The mean response time for Sanic is much lower, compared to CherryPy
Failures: Lower values indicate better reliability

Number of non-2xx responses increased for CherryPy with increase in concurrency. In contrast, number of failed requests in Sanic were below 10, even at high concurrency values.

Conclusion

Sanic clearly outperformed CherryPy, and was much faster, while supporting higher concurrency and requests per second, and displaying significantly lower failure rates.

Following these results, we transitioned to Sanic for ingesting high volume data into our datastores, and started seeing much faster and reliable performance. We now aggregate much larger volumes of data from the Web, at faster rates.

Of course, as mentioned earlier in the article, it is important to evaluate your API framework based on the nuances of your setup and its relevant objectives. In our setup, Sanic definitely seems to perform better than CherryPy.

What do you think? Let me know your thoughts in the comments section below.

If you’re curious to know more about DataWeave’s technology platform, check out our website, and if you wish to join our team, check out our jobs page!
January 24, 2018
Alibaba’s Singles Day Sale: Decoding the World’s Biggest Shopping Festival

$17.5 million every 60 seconds.

That’s the volume of sales Alibaba generated on 11.11, or Singles Day. This mammoth event, decisively the world’s biggest shopping day, dwarfed last years’ Black Friday and Cyber Monday combined.

This year, the anticipation around Singles Day was all-pervasive, and the sale was widely expected to break all records, as more than 60,000 global brands queued up to participate. By the end of the day, sales topped $25.3 billion, while shattering last year’s record by lunchtime.

It’s an astonishing feat of retailing, eight years in the making. When Alibaba first started 11.11 in 2009, they set out strategically to try and convert shopping into a sport, infusing it with a strong element of entertainment. “Retail as entertainment” is a unique central theme for 11.11 and this year Alibaba leveraged its media and eCommerce platforms in concert to create an entirely immersive experience for viewers and consumers alike.

From a technology perspective, the “See Now, Buy Now” fashion show and the pre-sale gala seamlessly merged offline and online shopping so viewers tuning in to both shows can watch them while simultaneously shopping via their phones or saving the items for a later date.

The eCommerce giant also collaborated with roughly 50 shopping malls in China to set up pop-up shops, eventually extending its shopper reach to span 12 cities.

Of course, attractive discounts on its eCommerce platforms were on offer as well.

Deciphering Taobao.com

At DataWeave, we have been analyzing the major sale events of several eCommerce companies from around the world. During Singles Day, when we trained our data aggregation and analysis platform on Taobao.com (Alibaba’s B2C eCommerce arm), and its competitors JD.com and Amazon.ch, our technology platform and analysts had to overcome two primary challenges:

1. All text on these websites were in Chinese

All information — names of products, brands, and categories — were displayed in Chinese. However, our technology platform is truly language agnostic, capable of processing data drawn from websites featuring all international languages. Several of our customers have benefited strategically from this unique capability.

2. Discounted prices were embedded in images on Taobao.com

While it’s normal for sale prices to be represented in text on a website (relatively easy to capture by our advanced data aggregation system), Taobao chose to display these prices as part of its product images — like the one shown in the adjacent image.

However, our technology stack comprises of an AI-powered, state-of-the-art image processing and analytics platform, which quickly extracted the selling prices embedded in the images at very high accuracy.

We analyzed the Top 150 ranked products of over 20 product types , spread across Electronics, Men’s Fashion, and Women’s Fashion, representing over 25,000 products in total, each day, between 8.11 and 12.11.

In the following infographic, we analyze the absolute discounts offered by Taobao on 11.11, compared to 8.11 (based on pricing information extracted from the product images using our image analytics platform), together with an insight into the level of premium products included in their mix for each product type, between the two days of comparison.

Unexpectedly, we noticed that each day, ALL the products in the Top 150 ranks differed from the previous day — a highly unique insight into Taobao’s unique assortment strategy.

Counter-intuitively, absolute discounts across all categories were considerably higher on 8.11 than on 11.11, even if it were for a marginally fewer number of products. The number of discounted Electronics products on sale rose on 11.11 compared to 8.11 (124 versus 102 respectively), while there was little movement in the number of discounted Men’s Fashion(55 versus 57) and Women’s Fashion (35 verses 27) products.

Taobao targeted the mobile phone and tablets segment with aggressive discounts (21.0 percent and 18.2 percent respectively), compared to the average Electronics discount level of 7.7 percent.

Interestingly, the average selling price drifted up for Electronics on 11.11 compared to 8.11 (¥4040 versus ¥3330). Men’s Fashion dropped to ¥584 from ¥604 while prices for Women’s Fashion was stable.

It’s clear that even with all the fanfare, Singles Day didn’t produce the level of discounts that one might have expected, indicating that purchases were driven as much by the hype surrounding the event as anything else.

How did Alibaba’s Competitors Fare?

While Taobao was widely expected to offer discounts during Alibaba’s major sale event, we looked at how its competitors JD.com and Amazon.ch reacted to Taobao’s strategy.

As over 80 percent of top-ranked products were consistently present in the Top 150 ranks of each product type on these websites, we analyzed the additional discounts offered during 11.11, compared to prices on 8.11.

Broadly speaking, both Amazon.ch and JD.com appear to have elected not to go head to head with Taobao on specific segments. JD.com’s discount strategy was spearheaded by Sports Shoes (22.1 percent) and Refrigerators (14.8 percent) while Amazon.ch featured TVs (15.3 percent) and Mobile Phones (10.2 percent).

The average additional discounts offered by Amazon.ch and JD.com in Electronics (8.4 percent) was slightly above Taobao’s overall absolute discount (7.7 percent). TCL was aggressive with its pricing on both websites, offering over 20% discount on almost its entire assortment.

Surprisingly, JD.com swamped Amazon.ch’s number of additionally discounted products, across all three featured categories although this may be partially explained by Amazon.ch electing to adopt a significantly more premium price position in both Men’s and Women’s Fashions compared to JD.com, while remaining roughly line ball on Electronics.

Jack Ma’s “New Retail”

Interestingly, JD.com wasn’t far behind Taobao in terms of sales, clocking up $20 billion in revenue, and sparking an interesting public debate between the two eCommerce giants extolling their respective performances.

Singles Day is one of the pillars of Jack Ma’s vision of a “New Retail” represented by the merging of entertainment and consumption. Ma’s vision sees the boundary between offline and online commerce disappearing as the focus shifts dramatically to fulfilling the personalized needs of individual customers.

Hence, Alibaba’s Global Shopping Festival should be understood as not just a one-day event that produces massive revenue, but as a demonstrable tour de force of Alibaba’s vision for the future of retail. One thing is certain — as competition heats up between Chinese retailers, we can be prepared for another Singles Day shoot-out sale next year that one-ups the staggering sales volumes this year.

If you’re intrigued by DataWeave’s technology, check out our website to learn more about how we provide Competitive Intelligence as a Service to retailers and consumer brands globally.

November 24, 2017
Video: Using Product Images to Achieve Over 90% Accuracy in Matching E-Commerce Products
Matching images is hard!

Images, intrinsically, are complex forms of information, with varying backgrounds, orientations, and noise. Developing a reliable system that achieves human-like accuracy in identifying, interpreting, and comparing images, without investing in expensive resources, is no mean task.

For DataWeave, however, the ability to accurately match images is fundamental to the value we provide to retailers and consumer brands.

Why Match Images?

Our customers rely on us for timely and actionable insights on their competitors’ pricing, assortment, promotions, etc. compared to their own. To enable this, we need to identify and match products across multiple websites, at very large scale.

One might hope to easily match products using just the product titles and descriptions on websites. However, therein lies the rub. Text-based fields are typically unstructured, and lack consistency or standardization across websites (especially for fashion products). In the following example, the same Adidas jacket is listed as “Tiro Warm-Up Jacket, Big Boys (8–20)” on Macy’s and “Youth Soccer Tiro 15 Training Jacket” on Amazon.

Hence, instead of using text-based information, we considered using deep-learning techniques to match the images of products listed on e-commerce websites. This, though, requires massive GPU resources and training data fed into the deep-learning model — an expensive proposition.

The solution we arrived upon, was to complement our image-matching system with the text-based information available in product titles and descriptions. Analyzing this combination of both text- and image-based information enabled us to efficiently match products at greater than 90% accuracy.

How We Did It

A couple of weeks ago, I gave a talk at Fifth Elephant, one of India’s renowned data science conferences. In the talk, I demonstrated DataWeave’s innovation of augmenting the NLP capabilities of Solr (a popular text search engine) with deep-learning features to match images with high accuracy.

Check out the video of the presentation for a detailed account of the system we built:

Human-Aided Machine Intelligence

All products matched with the seed product are tagged with a corresponding confidence score. When this score crosses a certain threshold, it’s presumed to be a direct match. The ones that are part of a lower range of confidence scores are quickly examined manually for possible direct matches.

The outcome, therefore, is that our technology narrows down the consideration set of possible product matches from a theoretical upper limit of millions of products, to only a few tens of products, which are then manually checked. This unique approach has two distinct advantages:
- The human-in-the-loop enables us to achieve greater than 90% accuracy in matching millions of products — a key differentiator.
- Information on all manually matched products is continually fed to the deep-learning model, which is used as training data, further enhancing the accuracy of the product matching mechanism. As a result, both our accuracy and delivery time keep improving with time.
As the world of online commerce continues to evolve and becomes more competitive, retailers and consumer brands need the ability to make quick proactive and reactive decisions, if they are to stay competitive. By building an automated self-improving system that matches products quickly and accurately, DataWeave enables just that.

Find out more about how retailers and consumer brands use DataWeave to better understand their competitive environment, optimize customer experience, and drive profitable growth.
August 9, 2017
Implement a Machine-Learning Product Classification System
For online retailers, price competitiveness and a broad assortment of products are key to acquiring new customers, and driving customer retention. To achieve these, they need timely, in-depth information on the pricing and product assortment of competing retailers. However, in the dynamic world of online retail, price changes occur frequently, and products are constantly added, removed, and running out of stock, which impede easy access to harnessing competitive information.

At DataWeave, we address this challenge by providing retailers with competitive pricing and assortment intelligence, i.e. information on their pricing and assortment, in comparison to their competition’s.

The Need for Product Classification

On acquiring online product and pricing data across websites using our proprietary data acquisition platform, we are tasked with representing this information in an easily consumable form. For example, retailers need product and pricing information along multiple dimensions, such as — the product categories, types, etc. in which they are the most and least price competitive, or the strengths and weaknesses of their assortment for each category, product type, etc.

Therefore, there is a need to classify the products in our database in an automated manner. However, this process can be quite complex, since in online retail, every website has its own hierarchy of classifying products. For example, while “Electronics” may be a top-level category on one website, another may have “Home Electronics”, “Computers and Accessories”, etc. as top-level categories. Some websites may even have overlaps between categories, such as “Kitchen and Furniture” and “Kitchen and Home Appliances”.

Addressing this lack of standardization in online retail categories is one of the fundamental building blocks of delivering information that is easily consumable and actionable.

We, therefore, built a machine-learning product classification system that can predict a normalized category name for a product, given an unstructured textual representation. For example:
- Input: “Men’s Wool Blend Sweater Charcoal Twist and Navy and Cream Small”
- Output: “Clothing”
- Input: “Nisi 58 mm Ultra Violet UV Filter”
- Output: “Cameras and Accessories”
To classify categories, we first created a set of categories that was inclusive of variations in product titles found across different websites. Then, we moved on to building a classifier based on supervised learning.

What is Supervised Learning?

Supervised learning is a type of machine learning in which we “train” a product classification system by providing it with labelled data. To classify products, we can use product information, along with the associated category as label, to train a machine learning model. This model “learns” how to classify new, but similar products into the categories we train it with.

To understand how product information can be used to train the model, we identified what data points about products we can use, and the challenges associated with using it.

For example, this is what a product’s record looks like in our database:
```
{
“title”: “Apple MacBook Pro Retina Display 13.3” 128 GB SSD 8 GB RAM”,
“website”: “Amazon”,
“meta”: “Electronics > Computer and Accessories > Laptops > Macbooks”,
“price”: “83000”
}
```
Here, “title” is unstructured text for a product. The hierarchical classification of the product on the given website is shown by “meta”.

This product’s “title” can be represented in a structured format as:
```
{
“Brand”: “Apple”,
“Screen Size”: “13.3 inches”,
“Screen Type”: “Retina Display”,
“RAM”: “8 GB”,
“Storage”: “128 GB SSD”
}
```
In this structured object, “Brand”, “Screen Size”, “Screen Type” and so on are referred to as “attributes”. Their associated items are referred to as “values”.

Challenges of Working with Text

Lack of uniformity in product titles across websites –

In the example shown above, the given structured object is only one way of structuring the given unstructured text (title). The product title would likely change for every website it’s represented on. What’s worse, some websites lack any form of structured representation. Also, attributes and values may have different representations on different websites — ‘RAM’ may be referred to as ‘Memory’.

Absence of complete product information –

Not all websites provide complete product information in the title. Even when structured information is provided, the level of detail may vary across websites.

Since these challenges are substantial, we chose to use unstructured titles of products as training inputs for supervised learning.

Pre-processing and Vectorisation of Training Data

Pre-processing of titles can be done as follows:
- Lowercasing
- Removing special characters
- Removing stop words (like ‘and’, ‘by’, ‘for’, etc.)
- Generating unigram and bigram tokens
- We represented the title as a vector using the Bag of Words model, with unigram and bigram tokens.
The Algorithm

We used Support Vector Machine (SVM) and compared the results with Naive Bayes Classifiers, Decision Trees and Random Forest.

Training Data Generation

The total number of product data we’ve acquired runs into the hundreds of millions, and every category has a different number of products. For example, we may have 40 million products in “Clothing” category but only 2 million products in the “Sports and Fitness” category. We used a stratified sampling technique to ensure that we got a subset of the data that captures the maximum variation in the entire data.

For each category, we included data from most websites that contained products of that category. Within each website, we included data from all subcategories and product types. The size of the data-set we used is about 10 million, sourced from 40 websites. We then divided our labelled data-set into two parts: training data-set and testing data-set.

Evaluating the Model

After training with the training dataset, we tested this machine-learning classification system using the testing dataset to find the accuracy of the model.

Clearly, SVM generated the best accuracy compared to the other classifiers.

Performance Statistics
- System Specifications: 8-Core system (Intel(R) Xeon(R) CPU E3–1231 v3 @ 3.40GHz) with 32 GB RAM
- Training Time: 90 minutes (approximately)
- Prediction Time: Approximately 6 minutes to classify 1 million product titles. This is equivalent to about 3000 titles per second.
Example Inputs and Outputs from the SVM Model (with Decision Values)
- Input: “Washing Machine Top Load”
Output: {“Home Appliances”: 1.45, “Home and Living”: 0.60, “Tools and Hardware”: 0.54}
- Input: “Nisi 58 mm Ultra Violet UV Filter”
Output: {“Cameras and Accessories”: 1.46, “Eyewear”: 1.14, “Home and Living”: 1.12}
- Input: “NETGEAR AirCard AC778AT Around Town Mobile Internet — Mobile hot”
Output: {“Computers and Accessories”: 0.82, “Books”: 0.61, “Toys”: 0.27}
- Input: “Nike Sports Tee”
Output: {“Sports and Fitness”: 1.63, “Footwear”: 0.63, “Toys and Baby Products”: 0.59}

Largely, most of the outputs were accurate, which is no mean feat. Some incorrect outputs were those of fairly similar categories. For example, “Home and Living” was predicted for products that should have ideally been part of “Home Appliances”. Other incorrect predictions occurred when the input was ambiguous.

There were also scenarios where the output decision values of the top two categories were quite close (as shown in the third example above), especially when the input was vague. In the last example above, the product should have been classified as “Clothing”, but got classified as “Sports and Fitness” instead, which is not entirely incorrect.

Delivering Value with Competitive Intelligence

The category classifier elucidated in this article is only the first element of a universal product organization system that we’ve built at DataWeave. The output of our category classification system is used by other in-house machine-learning and heuristic-based systems to generate more detailed product categories, types, subcategories, attributes, and the like.

Our universal product organization system is the backbone of the Competitive Pricing and Assortment Intelligence solutions we provide to online retailers, which enable them to evaluate their pricing and assortment against competitors along multiple dimensions, helping them compete effectively in the cutthroat eCommerce space.

Click here to find out more about DataWeave’s solutions and how modern retailers harness the power of data to drive revenue and margins.
June 22, 2017
Baahubali 2: Dissecting 75,000 Tweets to Uncover Audience Sentiments

Why did Katappa kill Baahubali?

Two years ago, not many would have foreseen this sentence capturing the imagination of the country like it has. Demolishing all regional barriers, the movie has grossed over INR 500 crores across the world in only its first three days.

While the first movie received lavish praise for its ambition, technical values, and story, the sequel, bogged by bloated expectations, has polarized the critics fraternity. Some critics compare the movie’s computer graphics favorably to Hollywood productions like Lord of the Rings. Others find the movie lacking in pacing and plot.

The masses, however, have reportedly lapped the movie up. Social media channels are brimming with opinions, and if one is to attempt finding out the aggregate views of audiences, Twitter is a good place to start.

At DataWeave, we ran our proprietary, AI-powered ‘Sentiment Analysis’ algorithm over all tweets about Baahubali 2 the first three days of its release, and observed some interesting insights.

Twitterati Reactions to Baahubali 2

Overall, the Twitterati’s views on the movie were overwhelmingly positive. We analysed over 75,000 tweets and identified the sentiments expressed on several facets of the movie, such as, Visuals, Acting, Prabhas, etc. The following graphic indicates how the movie fared in some of these categories.

The Baahubali team, Anushka (actor), Rajamouli (director), and Prabhas (actor), are all perceived as huge positive influences on the movie. Rajamouli, specifically, met with almost universal approval for his dedication and execution. Several viewers cheered the movie on as a triumph of Indian cinema, one which has redefined the cinema landscape of the country. There was considerable praise for the story, Rana (actor), and acting performances, as well.

The not-so-positive sentiments were reserved for the reason behind Katappa killing Baahubali (no spoilers!), the visuals, and the second half of the movie. Many viewers found the second half to be slow, with unrealistic visuals and action sequences. For example, one of the tweets read:

“First half was good, but the second half is beyond Rajnikanth movies: humans uprooting trees!”

While these insights seem simple enough to understand, the technology to filter inevitably chaotic online content and extract meaningful information is incredibly complex.

Unearthing Meaning from Chaos

At DataWeave, we provide enterprises with Competitive Intelligence as a Service by aggregating and analyzing millions of unstructured data points on the web, across multiple sources. This enables businesses to better understand their competitive environment and make data-driven decisions to grow their business.

One of our solutions — Sentiment Analysis — helps brands study customer preferences at a product attribute level by analyzing customer reviews. We used the same technology to analyze the reaction of audiences globally to Baahubali 2. After data acquisition, this process consists of three steps –

Step 1: Features Extraction

To identify the “features” that reviewers are talking about, we first understand the syntactical structure of the tweets and separate words into nouns, verbs, adjectives, etc. This needs to account for complexities like synonyms, spelling errors, paraphrases, noise, etc. Our AI-based technology platform then uses various advanced techniques to generate a list of “uni-features” and “compound features” (more than one word for a feature).

Step 2: Identifying Feature-Opinion Pairs

Next, we identify the relationship between the feature and the opinion. One of the reasons this is challenging with twitter is, most of the time, twitter users treat grammar with utter disdain. Case in point:

“I saw the movie visuals awesome bad climax felt director unnecessarily dragged the second half”

In this case, the feature-opinion pairs are visuals: awesome, climax: bad, second half: unnecessarily dragged. Clearly, something as simple as attributing the nearest opinion-word to the feature is not good enough. Here again, we use advanced AI-based techniques to accurately classify feature-opinion pairs.

We classified close to 1000 opinion words and matched them to each feature. The infographic below shows groups of similar words that the AI algorithm clustered into a single feature, and the top positive and negative sentiments expressed by the Twitterati for each feature.

While our technology can associate words with similar meaning, such as, ‘part after interval’ and ‘second half’, it can also identify spelling errors by identifying and grouping ‘Rajamouli’ and ‘Raajamouli’ as a single feature.

Adjectives like ‘magnificent’ and ‘creative’ were used to describe the Baahubali team positively, while words like ‘boring’, ‘disappointed’, and ‘tiring’ were used to describe the second half of the movie negatively.

Step 3: Sentiment Calculation

Lastly, we calculate the sentiment score, which is determined by the strength of the opinion-word, number of retweets and the time of tweet. A weighted average is normalized and we generate a score on a scale of 0% to 100%.

A Peephole into the Consumer’s Mind

As more and more people express their views and opinions in the online world, there is more of an opportunity to use these data points to drive business strategies.

Consumer-focused brands use DataWeave’s Sentiment Analysis solution as a key element of their product strategy, by reinforcing attributes with positive sentiments in reviews, and improving or eliminating attributes with negative sentiments in reviews.

Click here to find out more about the benefits of using DataWeave’s Sentiment Analysis!

May 5, 2017
A Peek into GNU Parallel
GNU Parallel is a tool that can be deployed from a shell to parallelize job execution. A job can be anything from simple shell scripts to complex interdependent Python/Ruby/Perl scripts. The simplicity of ‘Parallel’ tool lies in it usage. A modern day computer with multicore processors should be enough to run your jobs in parallel. A single core computer can also run the tool, but the user won’t be able to see any difference as the jobs will be context switched by the underlying OS.

At DataWeave, we use Parallel for automating and parallelizing a number of resource extensive processes ranging from crawling to data extraction. All our servers have 8 cores with capability of executing 4 threads in each. So, we experienced huge performance gain after deploying Parallel. Our in-house image processing algorithms used to take more than a day to process 200,000 high resolution images. After using Parallel, we have brought the time down to a little over 40 minutes!

GNU Parallel can be installed on any Linux box and does not require sudo access. The following command will install the tool:
```
(wget -O - pi.dk/3 || curl pi.dk/3/) | bash
```
GNU Parallel can read inputs from a number of sources — a file or command line or stdin. The following simple example takes the input from the command line and executes in parallel:
```
parallel echo ::: A B C
```
The following takes the input from a file:
```
parallel -a somefile.txt echo

Or STDIN:


cat somefile.txt | parallel echo
```
The inputs can be from multiple files too:
```
parallel -a somefile.txt -a anotherfile.txt echo
```
The number of simultaneous jobs can be controlled using the — jobs or -j switch. The following command will run 5 jobs at once:
```
parallel --jobs 5 echo ::: A B C D E F G H I J
```
By default, the number of jobs will be equal to the number of CPU cores. However, this can be overridden using percentages. The following will run 2 jobs per CPU core:
```
parallel --jobs 200% echo ::: A B C D
```
If you do not want to set any limit, then the following will use all the available CPU cores in the machine. However, this is NOT recommended in production environment as other jobs running on the machine will be vastly slowed down.
```
parallel --jobs 0 echo ::: A B C
```
Enough with the toy examples. The following will show you how to bulk insert JSON documents in parallel in a MongoDB cluster. Almost always we need to insert millions of document quickly in our MongoDB cluster and inserting documents serially doesn’t cut it. Moreover, MongoDB can handle parallel inserts.

The following is a snippet of a file with JSON document. Let’s assume that there are a million similar records in the file with one JSON document per line.
```
{“name”: “John”, “city”: “Boston”, “age”: 23} {“name”: “Alice”, “city”: “Seattle”, “age”: 31} {“name”: “Patrick”, “city”: “LA”, “age”: 27} ... ...
```
The following Python script will get each JSON document and insert into “people” collection under “dw” database.
```
import json

import pymongo

import sys

document = json.loads(sys.argv[1])

client = pymongo.MongoClient()

db = client[“dw”]

collection = db[“people”]

try:

    collection.insert(document)

except Exception as e:

    print “Could not insert document in db”, repr(e)
```
Now to run this parallely, the following command should do the magic:
```
cat people.json | parallel ‘python insertDB.py {}’
```
That’s it! There are many switches and options available for advanced processing. They can be accessed by doing a man parallel on the shell. Also the following page has a set of tutorials: GNU Parallel Tutorials.
August 4, 2015
How to Conquer Data Mountains API by API | DataWeave
Let’s revisit our raison d’être: DataWeave is a platform on which we do large-scale data aggregation and serve this data in forms that are easily consumable. The nature of the data that we deal with is that: (1) it is publicly available on the web, (2) it is factual (to the extent possible), and (3) it has high utility (decided by a number of factors that we discuss below).

The primary access channel for our data are the Data API. Other access channels such as visualizations, reports, dashboards, and alerting systems are built on top of our data APIs. Data Products such as PriceWeave, are built by combining multiple APIs and packaging them with reporting and analytics modules.

Even as the platform is capable of aggregating any kind of data on the web, we need to prioritize the data that we aggregate, and the data products that we build. There are a lot of factors that help us in deciding what kinds of data we must aggregate and the APIs we must provide on DataWeave. Some of these factors are:

Business Case: A strong business use-case for the API. There has to be an inherent pain point the data set must solve. Be it the Telecom Tariffs AP or Price Intelligence API — there are a bunch of pain points they solve for distinct customer segments.

Scale of Impact: There has to exist a large enough volume of potential consumers that are going through the pain points, that this data API would solve. Consider the volume of the target consumers for the Commodity Prices API, for instance.

Sustained Data Need: Data that a consumer needs frequently and/or on a long term basis has greater utility than data that is needed infrequently. We look at weather and prices all the time. Census figures, not so much.

Assured Data Quality: Our consumers need to be able to trust the data we serve: data has to be complete as well as correct. Therefore, we need to ensure that there exist reliable public sources on the Web that contain the data points required to create the API.

Once these factors are accounted for, the process of creating the APIs begins. One question that we are often asked is the following: how easy/difficult is it to create data APIs? That again depends on many factors. There are many dimensions to the data we are dealing with that helps us in deciding the level of difficulty. Below we briefly touch upon some of those:

1. Structure: Textual data on the Web can be structured/semi-structured/unstructured. Extracting relevant data points from semi-structured and unstructured data without the existence of a data model can be extremely tricky. The process of recognizing the underlying pattern, automating the data extraction process, and monitoring accuracy of extracted data becomes difficult when dealing with unstructured data at scale.

2. Temporality: Data can be static or temporal in nature. Aggregating static datas sets is an one time effort. Scenarios where data changes frequently or new data points are being generated pose challenges related to scalability and data consistency. For e.g., The India Local Mandi Prices AP gets updated on a day-to-day basis with new data being added. When aggregating data that is temporal, monitoring changes to data sources and data accuracy becomes a challenge. One needs to have systems in place that ensure data is aggregated frequently and also monitored for accuracy.

3. Completeness: At one end of the spectrum we have existing data sets that are publicly downloadable. On the other end, we have data points spread across sources. In order to create data sets over these data points, these data points need to be aggregated and curated in order for them to be used. These data sources publish data in their own format, update them at different intervals. As always, “the whole is larger than the sum of its parts”; these individual data points when aggregated and presented together have many more use cases than those for the individual data points themselves.

4. Representations: Data on the Web exists in various formats including (if we are particularly unlucky!) non-standard/proprietary ones. Data exists in HTML, XML, XLS, PDFs, docs, and many more. Extracting data from these different formats and presenting them through standard representations comes with its own challenges.

5. Complexity: The data sets wherein data points are independent of each other are fairly simple to reason about. On the other hand, consider network data sets such as social data, maps, and transportation networks. The complexity arises due to the relationships that can exist between data points within and across data sets. The extent of pre-processing required to analyse these relationships makes these data sets is huge even on a small scale.

6 .(Pre/Post) Processing: There is a lot of pre-processing involved to make raw crawled data presentable and accessible through a data API. This involves, cleaning, normalization, and representing data in standard forms. Once we have the data API, there can be a number of way that this data can be processed to create new and interesting APIs.

So, that at a high level, is the way we work at DataWeave. Our vision is that of curating and providing access to all of the world’s public data. We are progressing towards this vision one API at a time.

Originally published at blog.dataweave.in.
August 4, 2015
API of Telecom Recharge Plans in India
Several months ago we released our Telecom recharge plans API. It soon turned out to be one of our more popular APIs, with some of the leading online recharge portals using it extensively. (So, the next time you recharge your phone, remember us :))

In this post, we’ll talk in detail about the genesis of this API and the problem it is solving.

Before that — -and since we are into the business of building data products — some data points.

As you can see, most mobile phones in India are prepaid. That is to say, there is a huge prepaid mobile recharge market. Just how big is this market?

The above infographic is based on a recent report by Avendus [pdf]. Let’s focus on the online prepaid recharge market. Some facts:
1. There are around 11 companies that provide an online prepaid recharge service. Here’s the list: mobikwik, rechargeitnow, paytm, freecharge, justrechargeit, easymobilerecharge, indiamobilerecharge, rechargeguru, onestoprecharge, ezrecharge, anytimerecharge
2. RechargeItNow seems to be the biggest player. As of August 2013, they claimed an annual transactions worth INR 6 billion, with over 100000 recharges per day pan India.
3. PayTM, Freecharge, and Mobikwik seem to be the other big players. Freecharge claimed recharge volumes of 40000/day in June 2012 (~ INR 2 billion worth of transactions), and they have been growing steadily.
4. Telcos offer a commission of approximately 3% to third party recharge portals. So, it means there is an opportunity worth about 4 bn as of today.
5. Despite the Internet penetration in India being around 11%, only about 1% of mobile prepaid recharges happen online. This goes to show the huge opportunity that lies untapped!
6. It also goes to show why there are so many players entering this space. It’s only going to get crowded more.
What does all this have to do with DataWeave? Let’s talk about the scale of the “data problem” that we are dealing with here. Some numbers that give an estimate on this.

There are 13 cellular service providers in India. Here’s the list: Aircel Cellular Ltd, Aircel Limited, Bharti Airtel, BSNL, Dishnet Wireless, IDEA (operates as Idea ABTL & Spice in different states), Loop Mobile, MTNL, Reliable Internet, Reliance Telecom, Uninor, Videocon, and Vodafone. There are 22 circles in India. (Not every service provider has operations in every circle.)

Find below the number of telecom recharge plans we have in our database for various operators.

In fact, you can see that between the last week and today, we have added about 300 new plans (including plans for a new operator).

The number of plans varies across operators. Vodafone, for instance, gives its users a huge number of options.

The plans vary based on factors such as: denomination, recharge value, recharge talktime, recharge validity, plan type (voice/data), and of course, circle as well as the operator.

For a third party recharge service provider, the below are a daily pain point:
- plans become invalid on a regular basis
- new plans are added on a regular basis
- the features associated with a plan change (e.g, a ‘xx mins free talk time’ plan becomes ‘unlimited validity’ or something else)
We see that 10s of plans become invalid (and new ones introduced) every day. All third party recharge portals lose significant amount of money on a daily basis because: they might not have information about all the plans and they might be displaying invalid plans.

DataWeave’s Telecom Recharge Plans API solves this problem. This is how you use the API.

Sample API Request

“http://api.dataweave.in/v1/telecom_data/listByCircle/?api_key=b20a79e582ee4953ceccf41ac28aa08d&operator=Airtel&circle=Karnataka&page=1&per_page=10”

Sample API Output

We aggregate plans from the various cellular service providers across all circles in India on a daily basis. One of our customers once mentioned that earlier they used to aggregate this data manually, and it used to take them about a month to do this. With our API, we have reduced the refresh cycle to one day.

In addition, now that this is process is automated, they can be confident that the data they present to their customers is almost always complete as well as accurate.

Want to try it out for your business? Talk to us! If you are a developer who wants to use this or any other APIs, we let you use them for free. Just sign upand get your API key.

DataWeave helps businesses make data-driven decisions by providing relevant actionable data. The company aggregates and organizes data from the web, such that businesses can access millions of data points through APIs, dashboards, and visualizations.
August 4, 2015
Implementing API for Social Data Analysis
In today’s world, the analysis of any social media stream can reap invaluable information about, well, pretty much everything. If you are a business catering to a large number of consumers, it is a very important tool for understanding and analyzing the market’s perception about you, and how your audience reacts to whatever you present before them.

At DataWeave, we sat down to create a setup that would do this for some e-commerce stores and retail brands. And the first social network we decided to track was the micro-blogging giant, Twitter. Twitter is a great medium for engaging with your audience. It’s also a very efficient marketing channel to reach out to a large number of people.

Data Collection

The very first issue that needs to be tackled is collecting the data itself. Now quite understandably, Twitter protects its data vigorously. However, it does have a pretty solid REST API for data distribution purposes too. The API is simple, nothing too complex, and returns data in the easy to use JSON format. Take a look at the timeline API, for example. That’s quite straightforward and has a lot of detailed information.

The issue with the Twitter API however, is that it is seriously rate limited. Every function can be called in a range of 15–180 times in a 15-minute window. While this is good enough for small projects not needing much data, for any real-world application however, these rate limits can be really frustrating. To avoid this, we used the Streaming API, which creates a long-lived HTTP GET request that continuously streams tweets from the public timeline.

Also, Twitter seems to suddenly return null values in the middle of the stream, which can make the streamer crash if we don’t take care. As for us, we simply threw away all null data before it reached the analysis phase, and as an added precaution, designed a simple e-mail alert for when the streamer crashed.

Data Storage

Next is data storage. Data is traditionally stored in tables, using RDBMS. But for this, we decided to use MongoDB, as a document store seemed quite suitable for our needs. While I didn’t have much clue about MongoDB or what purpose it’s going to serve at first, I realized that is a seriously good alternative to MySQL, PostgreSQL and other relational schema-based data stores for a lot of applications.

Some of its advantages that I very soon found out were: documents-based data model that are very easy to handle analogous to Python dictionaries, and support for expressive queries. I recommend using this for some of your DB projects. You can play about with it here.

Data Processing

Next comes data processing. While data processing in MongoDB is simple, it can also be a hard thing to learn, especially for someone like me, who had no experience anywhere outside SQL. But MongoDB queries are simple to learn once the basics are clear.

For example, in a DB DWSocial with a collection tweets, the syntax for getting all tweets would be something like this in a Python environment:
```
rt = list(db.tweets.find())
```
The list type-cast here is necessary, because without it, the output is simply a MongoDB reference, with no value. Now, to find all tweets where user_id is 1234, we have
```
rt = list(db.retweets.find({ 'user_id': 1234 })
```
Apart from this, we used regexes to detect specific types of tweets, if they were, for example, “offers”, “discounts”, and “deals”. For this, we used the Python re library, that deals with regexes. Suffice is to say, my reaction to regexes for the first two days was much like

Once again, its just initial stumbles. After some (okay, quite some) help from Thothadri, Murthy and Jyotiska, I finally managed a basic parser that could detect which tweets were offers, discounts and deals. A small code snippet is here for this purpose.
```
def deal(id):

re_offers = re.compile(r'''

\b

(?:

deals?

|

offers?

|

discount

|

promotion

|

sale

|

rs?

|

rs\?

|

inr\s*([\d\.,])+

|

([\d\.,])+\s*inr

)

\b

|

\b\d+%

|

\$\d+\b

''',

re.I|re.X)

x = list(tweets.find({'user_id' : id,'created_at': { '$gte': fourteen_days_ago }}))

mylist = []

newlist = []

for a in x:

b = re_offers.findall(a.get('text'))

if b:

print a.get('id')

mylist.append(a.get('id'))

w = list(db.retweets.find( { 'id' : a.get('id') } ))

if w:

mydict = {'id' : a.get('id'), 'rt_count' : w[0].get('rt_count'), 'text' : a.get('text'), 'terms' : b}

else:

mydict = {'id' : a.get('id'), 'rt_count' : 0, 'text' : a.get('text'), 'terms' : b}

track.insert(mydict)
```
This is much less complicated than it seems. And it also brings us to our final step–integrating all our queries into a REST-ful API.

Data Serving

For this, mulitple web-frameworks are available. The ones we did consider were Flask, Django and Bottle.

Weighing the pros and cons of every framework can be tedious. I did find this awesome presentation on slideshare though, that succinctly summarizes each framework. You can go through it here.

We finally settled on Bottle as our choice of framework. The reasons are simple. Bottle is monolithic, i.e., it uses the one-file approach. For small applications, this makes for code that is easier to read and maintainable.

Some sample web address routes are shown here:

#show all tracked accounts
```
id_legend = {57947109 : 'Flipkart', 183093247: 'HomeShop18', 89443197: 'Myntra', 431336956: 'Jabong'}

@route('/ids')

  def get_ids():

    result = json.dumps(id_legend)

    return result
```
#show all user mentions for a particular account @route(‘/user_mentions’)
```
def user_mention():

  m = request.query.id

  ac_id = int(m)

  t = list(tweets.find({'created_at': { '$gte': fourteen_days_ago }, 'retweeted': 'no', 'user_id': { '$ne': ac_id} }))

  a = len(t)

  mylist = []

  for i in t:

    mylist.append({i.get('user_id'): i.get('id')})

  x = { 'num_of_mentions': a, 'mentions_details': mylist }

  result = json.dumps(x)

  return result
```
This is how the DataWeave Social API came into being. I had a great time doing this, with special credits to Sanket, Mandar and Murthy for all the help that they gave me for this. That’s all for now, folks!
August 4, 2015
Difference Between Json, Ultrajson, & Simplejson | DataWeave

Without argument, one of the most common used data model is JSON. There are two popular packages used for handling json — first is the stockjsonpackage that comes with default installation of Python, the other one issimplejson which is an optimized and maintained package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.

We have done the benchmark on three popular operations — load, loadsanddumps. We have a dictionary with 3 keys — id, name and address. We will dump this dictionary using json.dumps() and store it in a file. Then we will use json.loads() and json.load() separately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000,200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.

DUMPS OPERATION LINE BY LINE

Here is the result we received using the json.dumps() operations. We have dumped the content dictionary by dictionary.

We notice that json performs better than simplejson but ultrajson wins the game with almost 4 times speedup than stock json.

DUMPS OPERATION (ALL DICTIONARIES AT ONCE)

In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps().

simplejson is almost as good as stock json, but again ultrajson outperforms them by more than 60% speedup. Now lets see how they perform for load and loads operation.

LOAD OPERATION ON A LIST OF DICTIONARIES

Now we do the load operation on a list of dictionaries and compare the results.

Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 4 times faster than stock json, same with ultrajson.

LOADS OPERATION ON DICTIONARIES

In this experiment, we load dictionaries from the file one by one and pass them to the json.loads() function.

Again ultrajson steals the show, being almost 6 times faster than stock json and 4 times faster than simplejson.

That is all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any case, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

This post originally appeared here.

August 4, 2015