The DataWeave Blog

Category: Data Engineering

Maximizing Competitive Match Rates: The Foundation of Effective Price Intelligence
Merchants make countless pricing decisions every day. Whether you’re a brand selling online, a traditional brick-and-mortar retailer, or another seller attempting to navigate the vast world of commerce, figuring out the most effective price intelligence strategy is essential. Having your plan in place will help you price your products in the sweet spot that enhances your price image and maximizes profits.

For the best chance of success, your overall pricing strategy must include competitive intelligence.

Many retailers focus their efforts on just collecting the data. But that’s only a portion of the puzzle. The real value lies in match accuracy and knowing exactly which competitor products to compare against. In this article, we will dive deeper into cutting-edge approaches that combine the traditional matching techniques you already leverage with AI to improve your match rates dramatically.

If you’re a pricing director, category manager, commercial leader, or anyone else who deals with pricing intelligence, this article will help you understand why competitive match rates matter and how you can improve yours.

Change your mindset from tactical to strategic and see the benefits in your bottom line.

The Match Rate Challenge

To the layman, tracking and comparing prices against the competition seems easy. Just match up two products and see which ones are the same! In reality, it’s much more challenging. There are thousands of products to discover, analyze, compare, and derive subjective comparisons from. Not only that, product catalogs across the market are constantly evolving and growing, so keeping up becomes a race of attrition with your competitors.

Let’s put it into focus. Imagine you’re trying to price a 12-pack of Coca-Cola. This is a well-known product that, hypothetically, should be easy to identify across the web. However, every retailer uses their own description in their listing. Some examples include:
- Retailer A lists it as “Coca-Cola 12 Fl. Oz 12 Pack”
- Retailer B shows “Coca Cola Classic Soda Pop Fridge Pack, 12 Fl. Oz Cans, 12-Pack”
- Retailer C has “Coca-Cola Soda – 12pk/12 fl oz Cans”
While a human can easily deduce that these are the same product, the automated system you probably have in place right now is most likely struggling. It cannot tell the difference between the retailers’ unique naming conventions, including brand name, description, bundle, unit count, special characters, or sizing.

This has real-world business impacts if your tools cannot accurately compare the price of a Coca-Cola 12-pack across the market.

Why Match Rates Matter

If your competitive match rates are poor, you aren’t seeing the whole picture and are either overcharging, undercharging, or reacting to market shifts too slowly.

Overcharging can result in lost sales, while undercharging may result in out-of-stock due to spikes in demand you haven’t accounted for. Both are recipes to lose out on potential revenue, disappoint customers, and drive business to your competitors.

What you need is a sophisticated matching capability that can handle the tracking of millions of competitive prices each week. It needs to be able to compare using hundreds of possible permutations, something that is impossible for pricing teams to do manually, especially at scale. With technology to make this connection, you aren’t missing out on essential competitive intelligence.

The Business Impact

Besides the bottom-line savings, accurately matching competitor products for pricing intelligence has other business impacts that can help your business. Adding technology to your workflow to improve match rates can help identify blind spots, improve decision quality, and improve operational efficiency.
- Pricing Blind Spots
  - Missing competitor prices on key products
  - Inability to detect competitive threats
  - Delayed response to market changes
- Decision Quality
  - Incomplete competitive coverage leads to suboptimal pricing
  - Risk of pricing decisions based on wrong product comparisons
- Operational Efficiency
  - Manual verification costs
  - Time spent reconciling mismatched products
  - Resources needed to maintain price position
Current Industry Challenges

As mentioned, the #1 reason businesses like yours probably aren’t already finding the most accurate matches is that not all sites carry comparable product codes. If every listing had a consistent product code, it would be very easy to match that code to your code base. In fact, most retailers currently only achieve 60-70% match rates using their traditional methods.

Different product naming conventions, constantly changing product catalogs, and regional product variations contribute to the industry challenges, not to mention the difficulty of finding brand equivalencies and private label comparisons across the competition. So, if you’re struggling, just know everyone else is as well. However, there is a significant opportunity to get ahead of your competition if you can improve your match rates with technology.

The Matching Hierarchy
- Direct Code Matching: There are a number of ways to start finding matches across the market. The base tier of the hierarchy of most accurate approaches is Direct Code matching. Most likely, your team already has a process in place that can compare UPC to UPC, for example. When no standard codes are listed, your team is left with a blind spot. This poses limitations in modern retail but is an essential first step to identifying the “low-hanging fruit” to start getting matches.
- Non-Code-Based Matching: The next level of the hierarchy is implementing non-code-based matching strategies. This is when there are no UPCs, DPCIs, ASINs, or other known codes that make it easy to do one-to-one comparisons. These tools can analyze complex metrics like direct size comparisons, unique product descriptions, and features to find more accurate matches. They can look deep into the listing to extract data points beyond a code, even going as far as analyzing images and video content to help find matches. Advanced technologies for competitive matching can help pricing teams by adding different comparison metrics to their arsenal beyond code-based.
- Private Label Conversions: Up until this level of the hierarchy, comparisons relied on direct comparisons. Finding identical codes and features and naming similarities is excellent for figuring out one-to-one comparisons, but when there is no similar product to compare with for pricing intelligence, things get more complicated. This is the third tier of the matching hierarchy. It’s the ability to find similar product matches for ‘like’ products. This can be used for private label conversions and to create meaningful comparisons without direct matches.
- Similar Size Mappings: This final rung on the matching hierarchy adds another layer of advanced calculations to the comparison capability. Often, retailers and merchants list a product with different sizing values. One may choose to bundle products, break apart packs to sell as single items or offer a special-sized product manufactured just for them.
While at the end of the day, the actual product is the same, when there are unusual size permutations, it can be hard to identify the similarities. Technology can help with value size relationships, package variation handling, size equalization, and unit normalization.

The AI Advantage

AI is the natural solution for efficiently executing competitive product matching at scale. DataWeave offers solutions for pricing teams to help them reach over 95% product match accuracy. The tools leverage the most modern Natural Language Processing models for ingesting and analyzing product descriptions. Image recognition capabilities apply methods such as object detection, background removal, and image quality enhancement to focus on an individual product’s key features to improve match accuracy.

Deep learning models have been trained on years of data to perform pattern recognition in product attributes and to learn from historical matches. All of these capabilities, and others, automate the attribute matching process, from code to image to feature description, to help pricing teams build the most accurate profile of products across the market for highly accurate pricing intelligence.

Implementation Strategy

We understand that moving away from manual product comparison methods can be challenging. Every organization is different, but some fundamental steps can be followed for success when leveling up your pricing teams’ workflow.
1. First, conduct a baseline assessment. Figure out where you are on the Matching hierarchy. Are you still only doing direct code-based comparisons? Has your team branched out to compare other non-code-based identifiers?
2. Next, establish clear match rate targets for yourself. If your current match rate is aligned with industry norms, strive to significantly improve it, aiming for a high alignment that supports maximizing the match rate. Break this down into achievable milestones across different stages of the implementation process.
3. Work with your vendor on quality control processes. It may be worth running your current process in tandem to be able to calculate the improvements in real time. With a veteran technology provider like DataWeave, you can rely on the most cutting-edge technology combined with human-in-the-loop checks and balances and a team of knowledgeable support personnel. Additionally, for teams wanting direct control, DataWeave’s Approve/Disapprove Module lets your team review and validate match recommendations before they go live, maintaining full oversight of the matching process.
4. The more data about your products it has, the better your match rates. DataWeave’s competitive intelligence tools also come with a built-in continuous improvement framework. Part of this is the human element that continually ensures high-quality matches, but another is the AI’s ‘learning’ capabilities. Every time the AI is exposed to a new scenario, it learns for the next time.
5. The final step, ensure cross-functional alignment is achieved. Every one from the C-Suite down should be able to access the synthesized information useful for their role without complex data to sift through. Customized dashboards and reports can help with this process.
Future-Proofing Match Rates

The world of retail is constantly evolving. If you don’t keep up, you’re going to be left behind. There are emerging retail channels, like the TikTok shop, and new product identification methods to leverage, like image comparisons. As more products enter the market along with new retailers, figuring out how to scale needs to be taken into consideration. It’s impossible to keep up with manual processes. Instead, think about maximizing your match rates every week and not letting them degrade over time. A combination of scale, timely action, and highly accurate match rates will help you price your products the most competitively.

Key Takeaways

Match rates are the foundation of pricing intelligence. You can evaluate how advanced your match rate strategy is based on the matching hierarchy. If you’re still early in your journey, you’re likely still relying on code-to-code matches. However, using a mix of AI and traditional methods, you can achieve a 95% accuracy rate on product matching, leading to overall higher competitive match rates. As a result, with continuous improvement, you will stay ahead of the competition even as the goalposts change and new variables are introduced to the competitive landscape.

Starting this process to add AI to your pricing strategy can be overwhelming. At DataWeave, we work with you to make the change easy. Talk to us today to know more.
February 5, 2025
From Raw Data to Retail Pricing Intelligence: Transforming Competitive Data into Strategic Assets
Poor retail data is the bane of Chief Commercial Officers and VPs of Pricing. If you don’t have the correct inputs or enough of them in real time, you can’t make data-driven business decisions regarding pricing.

Retail data isn’t limited to your product assortment. Price data from your competition is as important as understanding your brand hierarchies and value size progressions. However, the vast and expanding nature of e-commerce means new competitors are around every corner, creating more raw data for your teams.

Think of competitive price data like crude oil. Crude or unrefined oil is an extremely valuable and sought-after commodity. But in its raw form, crude oil is relatively useless. Simply having it doesn’t benefit the owner. It must be transformed into refined oil before it can be used as fuel. This is the same for competitive data that hasn’t been transformed. Your competitive data needs to be refined into an accurate, consistent, and actionable form to power strategic insights.

So, how can retailers transform vast amounts of competitive pricing data into actionable business intelligence? Read this article to find out.

Poor Data Refinement vs. Good Refinement

Let’s consider a new product launch as an example of poor price data refinement vs. good data refinement, which affects most sellers across industries.

Retailer A

Imagine you’re launching a limited-edition sneaker. Sneakerheads online have highly anticipated the launch, and you know your competitors are watching you closely as go-live looms.

Now, imagine that your pricing data is outdated and unrefined when you go to price your new sneakers. You base your pricing assumptions on last year’s historical data and don’t have a way to account for real-time competitor movements. You price your new product the same as last year’s limited-edition sneaker.

Your competitor, having learned from last year, anticipates your new product’s price and has a sale lined up to go live mid-launch that undercuts you. Your team discovers this a week later and reacts with a markdown on the new product, fearing demand will lessen without action.

Customers who have already bought the much-anticipated sneakers feel like they’ve been overcharged now, and backlash on social media is swift. New buyers see the price reduction as proof that your sneakers aren’t popular, and demand decreases. This hurts your brand’s reputation, and the product launch is not deemed a success.

Retailer B

Imagine your company had refined competitive data to work with before launch. Your team can see trends in competitors’ promotional activity and can see that a line of sneakers at a major competitor is overdue for sale based on trends. Your team can anticipate that the competitor is planning to lower prices during your launch week in the hope of undercutting you.

Instead of needing to react retroactively with a markdown, your team comes up with clever ways to bundle accessories with a ‘deal’ during launch week to create value beyond just the price. During launch week, your competitor’s sneakers look like the lesser option while your new sneakers look like the premium choice while still being a good value. Customer loyalty improves, and buzz on social media is positive.

Here, we can see that refined data drives better decision-making and competitive advantage. It is the missing link in retail price intelligence and can set you ahead of the competition. However, turning raw competitive data into strategic insights is easier said than done. To achieve intelligence from truly refined competitive pricing data, pricing teams need to rely on technology.

The Hidden Cost of Unrefined Data

Technology is advancing rapidly, and more sellers are leveraging competitive pricing intelligence tools to make strategic pricing decisions. Retailers that continue to rely on old, manual pricing methods will soon be left behind.

You might consider your competitive data process to be quite extensive. Perhaps you are successfully gathering vast data about your competitors. But simply having the raw data is just as ineffective as having access to crude oil and making no plan to refine it. Collection alone isn’t enough—you need to transform it into a usable state.

Attempting to harmonize data using spreadsheets will waste time and give you only limited insights, which are often out of date by the time they’re discovered. Trying to crunch inflexible data will set your team up for failure and impact business decision quality.

The Two Pillars of Data Refinement

There are two foundational pillars in data refinement. Neither can truly be achieved manually, even with great effort.

Competitive Matches

There are always new sellers and new products being launched in the market. Competitive matching is the process of finding all these equivalent products across the web and tying them together with your products. It goes beyond matching UPCs to link identical products together. Instead, it involves matching products with similar features and characteristics, just as a shopper might decide to compare two similar products on the shelf. For instance private label brands are compared to legacy brands when consumers shop to discern value.

A retailer using refined competitive matches can quickly and confidently adjust its prices during a promotional event, know where to increase prices in response to demand and availability and stay attractive to sensitive shoppers without undercutting margins.

Internal Portfolio Matches

Product matching is a combination of algorithmic and manual techniques that work to recognize and link identical products. This can even be done internally across your product portfolio. Retailers selling thousands or even hundreds of thousands of products know the challenge of consistently pricing items with varying levels of similarity or uniformity. If you must sell a 12oz bottle of shampoo for $3.00 based on its costs, then a 16oz bottle of the same product should not sell for $2.75, even if that aligns with the competition.

Establishing a process for internal portfolio matching helps to eliminate inefficiencies caused by duplicated or misaligned product data. Instead of discovering discrepancies and having to fire-fight them one by one, an internal portfolio matching feature can help teams preempt this issue.

Leveraging AI for Enhanced Match Rates

As product SKUs proliferate and new sellers seem to enter the market at lightning speed, scaling is essential without hiring dozens more pricing experts. That’s where AI comes in. Not only can AI do the job of dozens of experts, but it also does it in a fraction of the time and at an improved match accuracy rate.

DataWeave’s AI-powered pricing intelligence and price monitoring offerings help retailers uncover gaps and opportunities to stay competitive in the dynamic world of e-commerce. It can gather competitive data from across the market and accurately match competitor products with internal catalogs. It can also internally match your product portfolio, identifying product family trees and setting tolerances to avoid pricing mismatches. The AI synthesizes all this data and links products into a usable format. Teams can easily access reports and dashboards to get their questions answered without manually attempting to refine the data first.

From Refinement to Business Value

Refined competitive price data is your team’s foundation to execute these essential pricing functions: price management, price reporting, and competitive intelligence.

Price Management

Refined data is the core of accurate price management and product portfolio optimization. Imagine you’re an electronics seller offering a range of laptops and personal computing devices marketed toward college students. Without refined competitive data, you might fail to account for pricing differences based on regionality for similar products. Demand might be greater in one city than in another. By monitoring your competition, you can match your forecasted demand assumptions with competitor pricing trends to better manage your prices and even offer a greater assortment where there is more demand.

Price Reporting

Leadership is always looking for new and better market positioning opportunities. This often revolves around how products are priced, whether you’re making a profit, and where. To effectively communicate across departments and with leadership, pricing teams need a convenient way to report on pricing and make changes or updates as new ad hoc requests come through. Spending hours constructing a report on static data will feel like a waste when the C-Suite asks for it again next week but with current metrics. Refined, constantly updated price data nips this problem in the bud.

Competitive Intelligence

Unrefined data can’t be used to discover competitive intelligence accurately. You might miss a new player, fail to account for a new competitive product line, or be unable to extract insights quickly enough to be helpful. This can lead to missed opportunities and misinformed strategies. As a seller, your competitive intelligence should be able to fuel predictive scenario modeling. For example, you should be able to anticipate competitor price changes based on seasonal trends. Your outputs will be wrong without the correct inputs.

Implementation Framework

As a pricing leader, you can take these steps to begin evaluating your current process and improve your strategy.
- Assess your current data quality: Determine whether your team is aggregating data across the entire competitive landscape. Ask yourself if all attributes, features, regionality, and other metrics are captured in a single usable format for your analysts to leverage.
- Setting refinement objectives: If your competitive data isn’t refined, what are your objectives? Do you want to be able to match similar products or product families within your product portfolio?
- Measuring success through KPIs: Establish a set of KPIs to keep you on track. Measure things like match rate accuracy, how quickly you can react to price changes, assortment overlaps, and price parity.
- Building cross-functional alignment: Create dashboards and establish methods to build ad hoc reports for external departments. Start the conversation with data to build trust across teams and improve the business.
What’s Next?

The time is now to start evaluating your current data refinement process to improve your ability to capture and leverage competitive intelligence. Work with a specialized partner like DataWeave to refine your competitive pricing data using AI and dedicated human-in-the-loop support.

Want help getting started refining your data fast? Talk to us to get a demo today!
January 30, 2025
How AI Can Drive Superior Data Quality and Coverage in Competitive Insights for Retailers and Brands
Managing the endlessly growing competitive data from across your eCommerce landscape can feel like pushing a boulder uphill. The sheer volume can be overwhelming, and ensuring that data meets standards of high accuracy and quality, and the insights are actionable is a constant challenge.

This article explores the challenges eCommerce companies face in having sustained access to high-quality competitive data and how AI-driven solutions like DataWeave empower brands and retailers with reliable, comprehensive, and timely market intelligence.

The Data Quality Challenge for Retailers and Brands

Brands and retailers make innumerable daily business decisions relying on accurate competitive and market data. Pricing changes, catalog expansion, development of new products, and where to go to market are just a few. However, these decisions are only as good as the insights derived from the data. If the data is made up of inaccurate or low-quality inputs, the outputs will also be low-quality.

Managing eCommerce data at scale gets more complex every year. There are more market entrants, retailers, and copy-cats trying to sell similar or knock-off products. There are millions of SKUs from thousands of retailers in multiple markets. Not only that, the data is constantly changing. Amazon may add a new subcategory definition in an existing space, or Staples might decide to branch out into a new industry like “snack foods for the office”, an established brand might introduce new sizing options in their apparel, or shrinkflation might decrease the size of a product.

Given this, it is imperative that conventional data collection and validation methods need to be revised. Teams that rely on spreadsheets and manual auditing processes can’t keep up with the scale and speed of change. An algorithm that once could match products easily needs to be updated when trends, categories, or terminology change.

With SKU proliferation, visually matching product images against the competition becomes impossible. Knowing where to look for comprehensive data becomes impossible with so many new sellers in the market. Luckily, technology has advanced to a place where manual intervention isn’t the main course of action.

Advanced AI capabilities, like DataWeave’s, tackle these challenges to help gather, categorize, and extract insights that drive impactful business decisions. It performs the millions of actions that your team can’t accomplish with greater accuracy and in near real-time.

Improving the Accuracy of Product Matching

DataWeave’s product matching capabilities rely on an ensemble of text and image-based models with built-in loss functions to determine confidence levels in all insights. These loss functions measure precision and recall. They help in determining how accurate – both in terms of correctness and completeness – the results are so the system can learn and improve over time. The solution’s built-in scoring function provides a confidence metric that brands and retailers can rely on.

The product matching engine is configurable based on the type of products that we are matching. It uses a “pipelined mode” that first focuses on recall or coverage by maximizing the search space for viable candidates, followed by mechanisms to improve the precision.

How ‘Embeddings’ Enhance Scoring

Embeddings are like digital fingerprints. They are dense vector representations that capture the essence of a product in a way that makes it easy to identify similar products. With embeddings, we can codify a more nuanced understanding of the varied relationships between different products. Techniques used to create good embeddings are generic and flexible and work well across product categories. This makes it easier to find similarities across products even with complex terminology, attributes, and semantics.

These along with advanced scoring mechanisms used across DataWeave’s eCommerce offerings provide the foundation for:
- Semantic Analysis: Embeddings identify subtle patterns and meanings in text and image data to better align with business contexts.
- Multimodal Integration: A comprehensive representation of each SKU is created by incorporating embeddings from both text (product descriptions) and images or videos (product visuals)
- Anomaly Detection: AI models leverage embeddings to identify outliers and inconsistencies to improve the overall score accuracy.
Vector Databases for Enhanced Accuracy

Vector databases play a central role in DataWeave’s AI ecosystem. These databases help with better storage, retrieval, and scoring of embeddings and serve to power real-time applications such as Verification. This process helps pinpoint the closest matches for products, attributes, or categories with the help of similarity algorithms. It can even operate when there is incomplete or noisy data. After identification, the system prioritizes data that exhibits high semantic alignment so that all recommendations are high-quality and relevant.

Evolution of Embeddings and Scoring: A Multimodal Perspective

Product listings undergo daily visual and text changes. DataWeave takes a multimodal approach in its AI to ensure that any content shown on a listing is accounted for, including visuals, videos, contextual signals, and text. DataWeave is continually evolving its embedding and scoring models to align with industry advancements and always works within an up-to-date context.

DataWeave’s AI framework can:
- Handle Diverse Data Types: The framework captures a holistic view of the digital shelf by integrating insights from multiple sources.
- Improve Matching Precision: Sophisticated scoring methods refine the accuracy of matches so that brands and retailers can trust the competitive intelligence.
- Scale Across Markets: Additional, expansive datasets are easy for DataWeave’s capabilities, meaning brands and retailers can scale across markets without pausing.
Quantified Improvements: Model Accuracy and Stats
- Since we deployed LLMs and CLIP Embeddings, Product Matching accuracy improved by > 15% from the previous baseline numbers in categories such as Home Improvement, Fashion, and CPG.
- High precision in certain categories such as Electronics and Fashion. Upwards of 85%.
- Close to 90% of matches are auto-processed (auto-verified or auto-rejected).
- Attribute tagging accuracy > 75% and significant improvement for the top 5 categories.
Business Use Case: Multimodal Matching for Price Leadership

For example, if you’re a retailer selling consumer electronics, you probably want to maintain your price leadership across your key markets during peak times like Black Friday Cyber Monday. Doing so is a challenge, as all your competitors are changing prices several times a day to steal your sales. To get ahead of them, this retailer could use DataWeave’s multimodal embedding-based scoring framework to:
- Detect Discrepancies: Isolate SKUs with price mismatches with your competition and take action before revenue is lost.
- Optimize Coverage: Establish a process to capture complete data across the competition so you can avoid knowledge gaps.
- Enable Timely Decisions: Address the ‘low-hanging fruit’ by prioritizing products that need pricing adjustments based on confidence scores on high-impact products. Leverage confidence metrics to prioritize pricing adjustments for high-impact products.
This approach helps retailers stay competitive even as eCommerce evolves around us. By acting fast on complete and reliable data, they can earn and sustain their competitive advantage.

DataWeave’s AI-Driven Data Quality Framework

Let’s look at how our AI can gather the most comprehensive data and output the highest-quality insights. Our framework evaluates three critical dimensions:
- Accuracy: “Is my data correct?” – Ensuring reliable product matches and attribute tracking
- Coverage: “Do I have the complete picture?” – Maintaining comprehensive market visibility
- Freshness: “Is my data recent?” – Guaranteeing timely and current market insights
Scoring Data Quality

To maintain the highest levels of data quality, we rely on a robust scoring mechanism across our solutions. Every dataset that is evaluated is done so based on several key parameters. These can include things like accuracy, consistency, timeliness, and completeness of data. Scores are dynamically updated as new data flows in so that insights can be acted upon.
- Accuracy: Compare gathered data with multiple trusted sources to reduce discrepancies.
- Consistency: Detect and rectify variations or contradictions across the data with regular audits.
- Timeliness: Scoring emphasizes data recency, especially for fast-changing markets like eCommerce.
- Completeness: Ensure all essential data points are included and gaps in coverage are highlighted by analysis.
Apart from this, we also leverage an evolved quality check framework:

Statistical Process Control

DataWeave implements a sophisticated system of statistical process control that includes:
- Anomaly Detection: Using advanced statistical techniques to identify and flag outlier data, particularly for price and stock variations
- Intelligent Alerting: Automated system for notifying stakeholders of significant deviations
- Continuous Monitoring: Real-time tracking of data patterns and trends
- Error Correction: Systematic approach to addressing and rectifying data discrepancies
Transparent Quality Assurance

The platform provides complete visibility into data quality through:
- Comprehensive Data Transparency & Statistics Dashboard: Offering detailed insights into match performance and data freshness
- Match Distribution Analysis: Tracking both exact and similar matches across retailers and locations as required
- Product Tracking Metrics: Visibility into the number of products being monitored
- Autonomous Audit Mechanisms: Giving customers access to cached product pages for transparent, on-demand verification
Human-in-the-Loop Validation (Véracité)

DataWeave’s Véracité system combines AI capabilities with human expertise to ensure unmatched accuracy:
- Expert Validation: Product category specialists who understand industry-specific similarity criteria
- Continuous Learning: AI models that evolve through ongoing expert feedback
- Adaptive Matching: Recognition that similarity criteria can vary by category and change over time
- Detailed Documentation: Comprehensive reasoning for product match decisions
Together, these elements create a robust framework that delivers accurate, complete, and relevant product data for competitive intelligence. The system’s combination of automated monitoring, statistical validation, and human expertise ensures businesses can make decisions based on reliable, high-quality data.

In Conclusion

DataWeave’s AI-driven approach to data quality and coverage empowers retailers and brands to navigate the complexities of eCommerce with confidence. By leveraging advanced techniques such as multimodal embeddings, vector databases, and advanced scoring functions, businesses can ensure accurate, comprehensive, and timely competitive intelligence. These capabilities enable them to optimize pricing, improve product visibility, and stay ahead in an ever-evolving market. As AI continues to refine product matching and data validation processes, brands can rely on DataWeave’s technology to eliminate inefficiencies and drive smarter, more profitable decisions.

The evolution of AI in competitive intelligence is not just about automation—it’s about precision, scalability, and adaptability. DataWeave’s commitment to high data quality standards, supported by statistical process controls, transparent validation mechanisms, and human-in-the-loop expertise, ensures that insights remain actionable and trustworthy. In a digital landscape where data accuracy directly impacts profitability, investing in AI-powered solutions like DataWeave’s is not just an advantage—it’s a necessity for sustained eCommerce success.

To learn more, reach out to us today or email us at contact@dataweave.com.
January 22, 2025
Enterprise Data Security at DataWeave: Empowering Smarter Decisions with Seamless, Secure Data Management and Integration
At DataWeave, data security isn’t just about compliance—it’s about enabling peace of mind and better decision-making for our customers. Our customers rely on us not just for competitive and market intelligence but also for the seamless integration of critical data sources into their decision-making frameworks. To achieve this, we have built a security-first infrastructure that ensures organizations can confidently leverage both external and internal data without compromising privacy or protection.

Secure Data Integration: The Foundation of Smarter Decisions

Effective decision-making in today’s digital commerce landscape depends on combining multiple data sources—including first-party customer data, pricing intelligence, and business rules—into a unified framework. However, without the right security measures in place, businesses often struggle to operationalize this data effectively.

At DataWeave, we eliminate this challenge by offering:
- Integration with Leading Data Storage Solutions: Our platform seamlessly connects with data lakes and warehouses like AWS S3 and Snowflake, ensuring that businesses can easily ingest and analyze our data in real time.
- Support for Sandboxed Environments & Data Clean Rooms: Organizations can securely merge internal and external datasets without compromising confidentiality, unlocking deeper insights for pricing and business strategies.
- Automated Data Ingestion & Management: We simplify the process of integrating first-party data alongside competitive insights, allowing customers to focus on execution rather than infrastructure management.
Our Purpose-Built Security Framework

Handling millions of data points daily demands a security framework that is not only robust but also scalable and adaptable to evolving threats. DataWeave’s multi-tenant architecture ensures seamless data security without compromising operational efficiency.
- Multi-Tenant Architecture: Our system allows multiple customers to share the same application infrastructure while maintaining complete data isolation and security.
  - Tenants share infrastructure and computing resources but remain logically isolated.
  - Application-level controls ensure privacy while maximizing cost efficiency.
  - Centralized updates, maintenance, and easy scalability for new tenants.
- End-to-End Encryption & Access Controls: Every piece of data is encrypted both in transit and at rest. Role-based access controls (RBAC) restrict visibility to only authorized personnel, ensuring minimal risk of unauthorized data access.
Active Monitoring & Automated Compliance Management: We leverage automated access controls that adjust permissions dynamically as organizational roles evolve, ensuring that compliance is continuously maintained.

Certifications That Inspire Confidence

Data security is at the core of everything we do. Our compliance with the highest industry standards ensures that businesses can trust us with their sensitive data.

SOC 2 Type II Certification: DataWeave’s SOC 2 compliance is a testament to our commitment to stringent security protocols. This certification guarantees that we adhere to strict standards in data protection, availability, and confidentiality.

We implement a phased approach to security improvement:
- Prioritizing Critical Systems: To maximize impact, we prioritized systems that had the highest data security relevance and expanded the coverage thereafter. By addressing these priority areas, we were able to make meaningful security improvements early in the process.
- Automating Monitoring and Compliance: Partnering with Sprinto streamlined the compliance journey by automating key processes. This included real-time monitoring of our cloud environments, automated generation of audit-ready evidence, and integration with critical systems like AWS, Bitbucket, and Jira. These enhancements ensured efficient management of compliance requirements while reducing the burden on our teams.
- Fostering a Culture of Shared Responsibility: We conducted organization-wide training sessions to embed compliance as a shared responsibility across all teams. By educating employees on the importance of security practices and providing them with the tools to manage compliance autonomously, we established a security-first mindset throughout the company.
This systematic method allowed us to deliver immediate improvements while aligning long-term practices with industry’s best standards.

What This Means for Our Customers

By combining robust security with seamless data integration, DataWeave empowers businesses to:
- Optimize Price Management & Modelling: With secure access to real-time data, organizations can make informed pricing decisions that enhance profitability and market competitiveness.
- Run Advanced Simulations & Testing: Reliable, secure data enables businesses to model various pricing and assortment strategies before implementation, reducing risks and maximizing returns.
- Uncompromised Data Security: SOC 2 Type II compliance ensures stringent protocols to protect your data at every stage.
- Simplified Vendor Processes: Verified security certifications reduce friction during due diligence and onboarding, making it easier to partner with us.
- Aligned Standards: Our adherence to industry benchmarks reflects our commitment to meeting your expectations as a trusted technology partner.
- Scalable Operations: Expand across regions while maintaining full confidence in data privacy and security.
- Secure Collaboration: Share insights across teams with tools designed to protect sensitive information.
Our customers are increasingly looking to integrate their internal datasets with the external competitive intelligence provided by DataWeave. This can be a complex and risky process without the right security measures in place. We remove these roadblocks by providing a secure, scalable infrastructure designed to help businesses unify data without security concerns.

By ensuring seamless compatibility with key data storage platforms, such as Snowflake and AWS S3, we enable organizations to consolidate valuable first-party data with timely market insights. This integration empowers businesses to refine their pricing, assortment, and digital shelf strategies, thereby driving superior customer experiences—without the headaches of data security risks.

Security remains a top priority in everything we do. Our SOC 2 Type II-certified framework enforces rigorous encryption, access controls, and real-time compliance monitoring. We take on the burden of data security so our customers can focus on innovation and growth.

With DataWeave, businesses can confidently leverage secure data-driven decision-making to unlock new opportunities, optimize operations, and scale without compromise.

To learn more, write to us at contact@dataweave.com or request a consultation here.
January 21, 2025
Redefining Product Attribute Tagging With AI-Powered Retail Domain Language Models
In online retail, success hinges on more than just offering quality products at competitive prices. As eCommerce catalogs expand and consumer expectations soar, businesses face an increasingly complex challenge: How do you effectively organize, categorize, and present your vast product assortments in a way that enhances discoverability and drives sales?

Having complete and correct product catalog data is key. Effective product attribute tagging—a crucial yet frequently undervalued capability—helps in achieving this accuracy and completeness in product catalog data. While traditional methods of tagging product attributes have long struggled with issues of scalability, consistency, accuracy, and speed, a new thinking and fundamentally new ways of addressing these challenges are getting established. These follow from the revolution brought in Large Language Models but they fashion themselves as Small Language Models (SLM) or more precisely as Domain Specific Language Models. These can be potentially considered foundational models as they solve a wide variety of downstream tasks albeit within specific domains. They are a lot more efficient and do a much better job in those tasks compared to an LLM. .

Retail Domain Language Models (RLMs) have the potential to transform the eCommerce customer journey. As always, it’s never a binary choice. In fact, LLMs can be a great starting point since they provide an enhanced semantic understanding of the world at large: they can be used to mine structured information (e.g., product attributes and values) out of unstructured data (e.g., product descriptions), create baseline domain knowledge (e.g, manufacturer-brand mappings), augment information (e.g., image to prompt), and create first cut training datasets.

Powered by cutting-edge Generative AI and RLMs, next-generation attribute tagging solutions are transforming how online retailers manage their product catalog data, optimize their assortment, and deliver superior shopping experiences. As a new paradigm in search emerges – based more on intent and outcome, powered by natural language queries and GenAI based Search Agents – the capability to create complete catalog information and rich semantics becomes increasingly critical.

In this post, we’ll explore the crucial role of attribute tagging in eCommerce, delve into the limitations of conventional tagging methods, and unveil how DataWeave’s innovative AI-driven approach is helping businesses stay ahead in the competitive digital marketplace.

Why Product Attribute Tagging is Important in eCommerce

As the eCommerce landscape continues to evolve, the importance of attribute tagging will only grow, making it a pertinent focus for forward-thinking online retailers. By investing in robust attribute tagging systems, businesses can gain a competitive edge through improved product comparisons, more accurate matching, understanding intent, and enhanced customer search experiences.

Taxonomy Comparison and Assortment Gap Analysis

Products are categorized and organized differently on different retail websites. Comparing taxonomies helps in understanding focus categories and potential gaps in assortment breadth in relation to one’s competitors: missing product categories, sizes, variants or brands. It also gives insights into the navigation patterns and information architecture of one’s competitors. This can help in making search and navigation experience more efficient by fine tuning product descriptions to include more attributes and/or adding additional relevant filters to category listing pages.

For instance, check out the different Backpack categories on Amazon and Staples in the images below.

Or look at the nomenclature of categories for “Pens” on Amazon (left side of the image) and Staples (right side of the image) in the image below.

Assortment Depth Analysis

Another big challenge in eCommerce is the lack of standardization in retailer taxonomy. This inconsistency makes it difficult to compare the depth of product assortments across different platforms effectively. For instance, to categorize smartphones,
- Retailer A might organize it under “Electronics > Mobile Phones > Smartphones”
- Retailer B could use “Technology > Phones & Accessories > Cell Phones”
- Retailer C might opt for “Consumer Electronics > Smartphones & Tablets”
Inconsistent nomenclature and grouping create a significant hurdle for businesses trying to gain a competitive edge through assortment analysis. The challenge is exacerbated if you want to do an in-depth assortment depth analysis for one or more product attributes. For instance, look at the image below to get an idea of the several attribute variations for “Desks” on Amazon and Staples.

Custom categorization through attribute tagging is essential for conducting granular assortment comparisons, allowing companies to accurately assess their product offerings against those of competitors.

Enhancing Product Matching Capabilities

Accurate product matching across different websites is fundamental for competitive pricing intelligence, especially when matching similar and substitute products. Attribute tagging and extraction play a crucial role in this process by narrowing down potential matches more effectively, enabling matching for both exact and similar products, and tagging attributes such as brand, model, color, size, and technical specifications.

For instance, when choosing to match similar products in the Sofa category for 2-3 seater sofas from Wayfair and Overstock, tagging attributes like brand, color, size, and more is a must for accurate comparisons.

Taking a granular approach not only improves pricing strategies but also helps identify gaps in product offerings and opportunities for expansion.

Fix Content Gaps and improve Product Detail Page (PDP) Content

Attribute tagging plays a vital role in enhancing PDP content by ensuring adherence to brand integrity standards and content compliance guidelines across retail platforms. Tagging attributes allows for benchmarking against competitor content, identifying catalog gaps, and enriching listings with precise details.

This strategic tagging process can highlight missing or incomplete information, enabling targeted optimizations or even complete rewrites of PDP content to improve discoverability and drive conversions. With accurate attribute tagging, businesses can ensure each product page is fully optimized to capture consumer attention and meet retail standards.

Elevating the Search Experience

In today’s online retail marketplace, a superior search experience can be the difference between a sale and a lost customer. Through in-depth attribute tagging, vendors can enable more accurate filtering to improve search result relevance and facilitate easier product discovery for consumers.

By integrating rich product attributes extracted by AI into an in-house search platform, retailers can empower customers with refined and user-friendly search functionality. Enhanced search capabilities not only boost customer satisfaction but also increase the likelihood of conversions by helping shoppers find exactly what they’re looking for more quickly and with minimal effort.

Pitfalls of Conventional Product Tagging Methods

Traditional methods of attribute tagging, such as manual and rule-based systems, have been significantly enhanced by the advent of machine learning. While these approaches may have sufficed in the past, they are increasingly proving inadequate in the face of today’s dynamic and expansive online marketplaces.

Scalability

As eCommerce catalogs expand to include thousands or even millions of products, the limitations of machine learning and rule-based tagging become glaringly apparent. As new product categories emerge, these systems struggle to keep pace, often requiring extensive revisions to existing tagging structures.

Inconsistencies and Errors

Not only is reliance on an entirely human-driven tagging process expensive, but it also introduces a significant margin for error. While machine learning can automate the tagging process, it’s not without its limitations. Errors can occur, particularly when dealing with large and diverse product catalogs.

As inventories grow more complex to handle diverse product ranges, the likelihood of conflicting or erroneous rules increases. These inconsistencies can result in poor search functionality, inaccurate product matching, and ultimately, a frustrating experience for customers, drawing away the benefits of tagging in the first place.

Speed

When product information changes or new attributes need to be added, manually updating tags across a large catalog is a time-consuming process. Slow tagging processes make it difficult for businesses to quickly adapt to emerging market trends causing significant delays in listing new products, potentially missing crucial market opportunities.

How DataWeave’s Advanced AI Capabilities Revolutionize Product Tagging

Advanced solutions leveraging RLMs and Generative AI offer promising alternatives capable of overcoming these challenges and unlocking new levels of efficiency and accuracy in product tagging.

DataWeave automates product tagging to address many of the pitfalls of other conventional methods. We offer a powerful suite of capabilities that empower businesses to take their product tagging to new heights of accuracy and scalability with our unparalleled expertise.

Our sophisticated AI system brings an advanced level of intelligence to the tagging process.

RLMs for Enhanced Semantic Understanding

Semantic Understanding of Product Descriptions

RLMs analyze the meaning and context of product descriptions rather than relying on keyword matching.
Example: “Smartphone with a 6.5-inch display” and “Phone with a 6.5-inch screen” are semantically similar, though phrased differently.

Attribute Extraction

RLMs can identify important product attributes (e.g., brand, size, color, model) even from noisy or unstructured data.
Example: Extracting “Apple” as a brand, “128GB” as storage, and “Pink” as the color from a mixed description.

Identifying Implicit Relationships

RLMs find implicit relationships between products that traditional rule-based systems miss.
Example: Recognizing that “iPhone 12 Pro” and “Apple iPhone 12” are part of the same product family.

Synonym Recognition in Product Descriptions

Synonym Matching with Context

RLMs identify when different words or phrases describe the same product.
Examples: “Sneakers” = “Running Shoes”, “Memory” = “RAM” (in electronics)
Even subtle differences in wording, like “rose gold” vs “pink” are interpreted correctly.

Overcoming Brand-Specific Terminology

Some brands use their own terminologies (e.g., “Retina Display” for Apple).
RLMs can map proprietary terms to more generic ones (e.g., Retina Display = High-Resolution Display).

Dealing with Ambiguities

RLMs analyze surrounding text to resolve ambiguities in product descriptions.
Example: Resolving “charger” to mean a “phone charger” when matched with mobile phones.

Contextual Understanding for Improved Accuracy and Precision

By leveraging advanced natural language processing (NLP), DataWeave’s AI can process and understand the context of lengthy product descriptions and customer reviews, minimizing errors that often arise at human touch points. The solution processes and interprets information to extract key information to dramatically improve the overall accuracy of product tags.

It excels at grasping the subtle differences between similar products, sizes, colors and identifying and tagging minute differences between items, ensuring that each product is uniquely and accurately represented in a retailer’s catalog.

This has a major impact on product and similarity-based matching that can even help optimize similar and substitute product matching to enhance consumer search. At the same time, our AI can understand that the same term might have different meanings in various product categories, adapting its tagging approach based on the specific context of each item.

This deep comprehension ensures that even nuanced product attributes are accurately captured and tagged for easy discoverability by consumers.

Case Study: Niche Jewelry Attributes

DataWeave’s advanced AI can assist in labeling the subtle attributes of jewelry by analyzing product images and generating prompts to describe the image. In this example, our AI identifies the unique shapes and materials of each item in the prompts.

The RLM can then extract key attributes from the prompt to generate tags. This assists in accurate product matching for searches as well as enhanced product recommendations based on similarities.

This multi-model approach provides the flexibility to adapt as product catalogs expand while remaining consistent with tagging to yield more robust results for consumers.

Unparalleled Scalability

DataWeave can rapidly scale tagging for new categories. The solution is built to handle the demands of even the largest eCommerce catalogs enabling:
- Effortless management of extensive product catalogs: We can process and tag millions of products without compromising on speed or accuracy, allowing businesses to scale without limitations.
- Automated bulk tagging: New product lines or entire categories can be tagged automatically, significantly reducing the time and resources required for catalog expansion.
Normalizing Size and Color in Fashion

Style, color, and size are the core attributes in the fashion and apparel categories. Style attributes, which include design, appearance, and overall aesthetics, can be highly specific to individual product categories.

Our product matching engine can easily handle color and sizing complexity via our AI-driven approach combined with human verification. By leveraging advanced technology to identify and normalize identical and similar products from competitors, you can optimize your pricing strategy and product assortment to remain competitive. Using Generative AI in normalizing color and size in fashion is key to powering competitive pricing intelligence at DataWeave.

Continuous Adaptation and Learning

Our solution evolves with your business, improving continuously through feedback and customization for retailers’ specific product categories. The system can be fine-tuned to understand and apply specialized tagging for niche or industry-specific product categories. This ensures that tags remain relevant and accurate across diverse catalogs and as trends emerge.

The AI in our platform also continuously learns from user interactions and feedback, refining its tagging algorithms to improve accuracy over time.

Stay Ahead of the Competition With Accurate Attribute Tagging

In the current landscape, the ability to accurately and consistently tag product attributes is no longer a luxury—it’s essential for staying competitive. With advancements in Generative AI, companies like DataWeave are revolutionizing the way product tagging is handled, ensuring that every item in a retailer’s catalog is presented with precision and depth. As shoppers demand a more intuitive, seamless experience, next-generation tagging solutions are empowering businesses to meet these expectations head-on.

DataWeave’s innovative approach to attribute tagging is more than just a technical improvement; it’s a strategic advantage in an increasingly competitive market. By leveraging AI to scale and automate tagging processes, online retailers can keep pace with expansive product assortments, manage content more effectively, and adapt swiftly to changes in consumer behavior. In doing so, they can maintain a competitive edge.

To learn more, talk to us today!
November 14, 2024
Normalizing Size and Color in Fashion Using AI to Power Competitive Price Intelligence
Fashion is as dynamic a market as any—and more competitive than most others. Consumer trends and customer needs are always evolving, making it challenging for fashion and apparel brands to keep up.

Despite the inherent difficulties fashion and apparel sellers face, this industry is one of the largest grossing markets in the world, estimated at $1.79 trillion in 2024. Global revenue for apparel is expected to grow at an annual rate of about 3.3% over the next four years. That means companies in this space stand to make significant revenue if they can competitively price their products, keep up with the competition, and win customer loyalty with consistent product availability.

There are three main categories in fashion and apparel. These include:
- Apparel and clothing (i.e., shirts, pants, dresses, and other apparel)
- Footwear (i.e., sneakers, sandals, heels, and other products)
- Accessories (i.e., bags, belts, watches, and so on)
If you look at all of these product types across all sorts of retailers, there is a massive amount of overlapping data based on product attributes like style and size that are difficult to normalize.

Fashion Attributes

Style, color, and size are the main attribute categories in fashion and apparel. Style attributes include things like design, look, and overall aesthetics of the product. They’re very dependent on the actual product category of fashion as well. A shirt might have a slim fit attribute associated with it, whereas a belt might have a length. All these different attributes are usually labeled within a product listing and affect the consumer’s decision-making process:
- Color (red, blue, sea green, etc.)
- Pattern (solid, striped, checked, floral, etc.)
- Material (cotton, polyester, leather, denim, silk, etc.)
- Fit (regular, slim, relaxed, oversized, tailored, etc.)
- Type (casual, formal, sporty, vintage, streetwear)
Color Complexity in Fashion

Color is perhaps the most visually distinctive attribute in fashion, yet it presents unique challenges for retailers. This is because color naming can vary across retailers and marketplaces. There are several major differences in color convention:
- A single color can be labeled differently across brands (e.g., “navy,” “midnight blue,” “deep blue”)
- Seasonal color names (e.g., “summer sage” vs. “forest green”)
- Marketing-driven names (e.g., “sunset coral” vs. “pale orange”)
Size: The Other Critical Dimension

Size in fashion refers to the dimensions or measurements that determine how fashion products fit. Depending on whether the product is a clothing item, shoes, or a hat, there will be different sizing options. Types of sizes include:
- Standard sizes (XS, S, M, L, XL, XXL, XXL)
- Custom sizes (based on brand, retailer, country, etc.)
A single type of product may have different sizing labels. For instance, one pants listing may use traditional S, M, L, XL sizing, while another pants listing may use 24, 25, or 26, to refer to the waist measurement.

Size is a dynamic attribute that changes based on current trends. For example, there has recently been a significant shift towards inclusive sizing. Size inclusivity refers to the practice of selling apparel in a wide range of sizes to accommodate people of all body types. Consumers are more aware of this trend and are demanding a broader range of sizing offerings from the brands they shop from.

In the US market, in particular, some 67% of American women wear a size 14 or above and may be interested in purchasing plus-size clothing. There is a growing demand in the plus-size market for more options and a wider selection. Many brands are considering expanding their sizes to accommodate more shoppers and tap into this growing revenue channel.

Pricing Based on Size and Color

Many fashion products are priced differently based on size and color. Let’s take a look at an example of what this can look like.

A popular beauty brand (see image) is known for its viral lip tint. While most of the color variants are priced at $9.90 on Amazon, a specific colorway option, featuring less pigmented options, is priced at $9.57. This price differential is driven by both material costs and market demand.

Different colorways (any of a range of combinations of colors in which a style or design is available) of the same product often command different prices also. This is based on:
- Dye costs (some colors require more expensive processes)
- Seasonal demand (traditional colors vs. trend colors)
- Exclusivity (limited edition colors)
An example of price variations by size is a women’s shirt that is being sold on Amazon as shown below. For this product, there are no style attributes to choose from. The only parameter the shopper has to select is the size they’d like to purchase. They can choose from S to XL. On the top, we can see that the product in size S is ₹389. Below, the size XL version of this same shirt is ₹399. This price increase is correlated to the change in size.

So why are these same products priced differently? In an analysis of One Six, a plus-size clothing brand, several reasons for this difference in plus-size clothing were determined.
- Extra material is needed, hence an increase in production costs
- Extra stitching costs, hence an increase in production costs
- Production of plus-size clothing often means acquiring specialized machinery
- Smaller scale production runs for plus-size clothing means these initiatives often don’t benefit from cost savings
Some sizes are sold more than others, meaning that in-demand sizes for certain apparel can affect pricing as well. Brands want to be able to charge as much as possible for their listing without risking losing a sale to a competitor.

The Competitive Pricing Challenge: Normalizing Product Attributes Across Competitors in Apparel and Fashion

There are hundreds of possible attribute permutations for every single apparel product. Some retailers may only sell core sizes and basic colors; some may sell a mix of sizes for multiple style types. Most retailers also sell multiple color variants for all styles they have on catalog. Other retailers may only sell a single, in-demand size of the product. Also, when other retailers are selling the product, it’s unlikely that their naming conventions, color options, style options, and sizing match yours one-for-one.

In one analysis, it was found that there were 800+ unique values for heel sizes and 1000+ unique values for shirts and tops at a single retailer! If you’re looking to compare prices, the effort involved in setting up and managing lookup tables to identify discrepancies when one retailer uses European sizes and another uses USA sizes, for example, is simply too onerous to contemplate doing. Colors only add to the complexity – as similar colors may have new names in different regions and locations as well!

Even if you managed to find all the discrepancies between product attributes, you would still need to update them any time a competitor changed a convention.

Still, monitoring your competitors and strategically pricing your listings is essential to maintain and grow market share. So what do you do? You can’t simply eyeball your competitor’s website to check their pricing and naming conventions. Instead, you need advanced algorithms to scan the entire marketplace, identify individual products being sold, and normalize their data and attributes for analysis.

Getting Color and Size Level Pricing Intelligence

With DataWeave, size and color are just two of several dimensions of a product instead of an impossible big data problem for teams. Our product matching engine can easily handle color and sizing complexity via our AI-driven approach combined with human verification.

This works by using AI built on more than 10 years of product catalog data across thousands of retail websites. It matches common identifiers, like UPC, SKU code, and other attributes for harmonization before employing a large language model (LLM) prompts to normalize color variations and sizing to a single standard.

For example, if a competitor has the smallest size listed as Sm but has your smallest listing identified as S, DataWeave can match those two attributes using AI. Similar classification can be performed on color as well.

Complex LLM prompts are pre-established so that this process is fast and efficient, taking minutes rather than weeks of manual effort.

Harmonizing products along with their color and sizing data across different retailers for further analysis has several benefits. Most importantly, product matching helps teams conduct better competitive analysis, allowing them to stay informed about market trends, competitors’ offerings, and how those competitors are pricing various permutations of the same product. It helps ensure that you’re offering the most competitive assortment of sizing in several colors to win more market share as well. Overall, it’s easier for teams to gain insights and exploit their findings when all the data is clean and available at their fingertips.

Product Matching Size and Color in Apparel and Fashion

Color and size are crucial attributes for retailers and brands in the apparel and fashion industry. It adds a level of complexity that can’t be overstated. While it’s a necessity to win consumers (more colors and sizes will mean a wider potential reach), the more permutations you add to your listing, the more complicated it will be to track it against your competition. However, This challenge is worth undertaking as long as you have the right solutions at your disposal.

With a strategy backed by advanced technology to discover identical and similar products across the competitive landscape and normalize their color and sizing attributes, you can ensure that you are competitively pricing your products and offering the best assortment possible. Employing DataWeave’s AI technology to find competitor listings, match products across variants, and track pricing regularly is the way to go.

Interested in learning more about DataWeave? Click here to get in touch!
November 6, 2024
DataWeave’s AI Evolution: Delivering Greater Value Faster in the Age of AI and LLMs
In retail, competition is fierce, and in its ever-evolving landscape, consumer expectations are higher than ever.

For years, our AI-driven solutions have been the foundation that empowers businesses to sharpen their competitive pricing and optimize digital shelf performance. But in today’s world, evolution is constant—so is innovation. We now find ourselves at the frontier of a new era in AI. With the dawn of Generative AI and the rise of Large Language Models (LLMs), the possibilities for eCommerce companies are expanding at an unprecedented pace.

These technologies aren’t just a step forward; they’re a leap—propelling our capabilities to new heights. The insights are deeper, the recommendations more precise, and the competitive and market intelligence we provide is sharper than ever. This synergy between our legacy of AI expertise and the advancements of today positions DataWeave to deliver even greater value, thus helping businesses thrive in a fast-paced, data-driven world.

This article marks the beginning of a series where we will take you through these transformative AI capabilities, each designed to give retailers and brands a competitive edge.

In this first piece, we’ll offer a snapshot of how DataWeave aggregates and analyzes billions of publicly available data points to help businesses stay agile, informed, and ahead of the curve. These fall into four broad categories:
- Product Matching
- Attribute Tagging
- Content Analysis
- Promo Banner Analysis
- Other Specialized Use Cases
Product Matching

Dynamic pricing is an indispensable tool for eCommerce stores to remain competitive. A blessing—and a curse—of online shopping is that users can compare prices of similar products in a few clicks, with most shoppers gravitating toward the lowest price. Consequently, retailers can lose sales over minor discrepancies of $1–2 or even less.

All major eCommerce platforms compare product prices—especially their top selling products—across competing players and adjust prices to match or undercut competitors. A typical product undergoes 20.4 price changes annually, or roughly once every 18 days. Amazon takes it to the extreme, changing prices approximately every 10 minutes. It helps them maintain a healthy price perception among their consumers.

However, accurate product matching at scale is a prerequisite for the above, and that poses significant challenges. There is no standardized approach to product cataloging, so even identical products bear different product titles, descriptions, and attributes. Information is often incomplete, noisy, or ambiguous. Image data contains even more variability—the same product can be styled using different backgrounds, lighting, orientations, and quality; images can have multiple overlapping objects of interest or extraneous objects, and at times the images and the text on a single page might belong to completely different products!

DataWeave leverages advanced technologies, including computer vision, natural language processing (NLP), and deep learning, to achieve highly accurate product matching. Our pricing intelligence solution accurately matches products across hundreds of websites and automatically tracks competitor pricing data.

Here’s how it works:

Text Preprocessing

It identifies relevant text features essential for accurate comparison.
- Metadata Parsing: Extracts product titles, descriptions, attributes (e.g., color, size), and other structured data elements from Product Description Pages (PDP) that can help in accurately identifying and classifying products.
- Attribute-Value Normalization: Normalize attributes names (e.g. RAM vs Memory) and their values (e.g., 16 giga bytes vs 16 gigs vs 16 GB); brand names (e.g., Benetton vs UCB vs United Colors of Benetton); mapping category hierarchies a standard taxonomy.
- Noise Removal: Removes stop words and other elements with no descriptive value; this focuses keyword extraction on meaningful terms that contribute to product identification.
Image Preprocessing

Image processing algorithms use feature extraction to define visual attributes. For example, when comparing images of a red T-shirt, the algorithm might extract features such as “crew neck,” “red,” or “striped.”

Image hashing techniques create a unique representation (or “hash”) of an image, allowing for efficient comparison and matching of product images. This process transforms an image into a concise string or sequence of numbers that captures its essential features even if the image has been resized, rotated, or edited.

Before we perform these activities there is a need to preprocess images to prepare them for downstream operations. These include object detection to identify objects of interest, background removal, face/skin detection and removal, pose estimation and correction, and so forth.

Embeddings

We have built a hybrid or a multimodal product-matching engine that uses image features, text features, and domain heuristics. For every product we process we create and store multiple text and image embeddings in a vector database. These include a combination of basic feature vectors (e.g. tf-idf based, colour histograms, share vectors) to more advanced deep learning algorithms-based embeddings (e.g., BERT, CLIP) to the latest LLM-based embeddings.

Classification

Classification algorithms enhance product attribute tagging by designating match types. For example, the product might be identified as an “exact match”, “variant”, “similar”, or “substitute.” The algorithm can also identify identical product combinations or “baskets” of items typically purchased together.

What is the Business Impact of Product Matching?
- Pricing Intelligence: Businesses can strategically adjust pricing to remain competitive while maintaining profitability. High-accuracy price comparisons help businesses analyze their competitive price position, identify opportunities to improve pricing, and reclaim market share from competitors.
- Similarity-Based Matching: Products are matched based on a range of similarity features, such as product type, color, price range, specific features, etc., leading to more accurate matches.
- Counterfeit Detection: Businesses can identify counterfeit or unauthorized versions of branded products by comparing them against authentic product listings. This helps safeguard brand identity and enables brands to take legal action against counterfeiters.
Attribute Tagging

Attribute tagging involves assigning standardized tags for product attributes, such as brand, model, size, color, or material. These naming conventions form the basis for accurate product matching. Tagging detailed attributes, such as specifications, features, and dimensions, helps match products that meet similar criteria. For example, tags like “collar” or “pockets” for apparel ensure high-fidelity product matches for hard-to-distinguish items with minor stylistic variations.

Including tags for synonyms, variants, and long-tail keywords (e.g., “denim” and “jeans”) improves the matching process by recognizing different terms used for similar products. Metadata tags categorize similar items according to SKU numbers, manufacturer details, and other identifiers.

Altogether, these capabilities provide high-quality product matches and valuable metadata for retailers to classify their products and compare their product assortment to competitors.

User-Generated Content (UGC) Analysis

Customer reviews and ratings are rich sources of information, enabling brands to gauge consumer sentiment and identify shortcomings regarding product quality or service delivery. However, while informative, reviews constitute unstructured “noisy” data that is actionable only if parsed correctly.

Here’s where DataWeave’s UGC analysis capability steps in.
- Feature Extractor: Automatically pulls specific product attributes mentioned in the review (e.g., “battery life,” “design” and “comfort”)
- Feature Opinion Pair: Pairs each product attribute with a corresponding sentiment from the review (e.g., “battery life” is “excellent,” “design” is “modern,” and “comfort” is “poor”)
- Calculate Sentiment: Calculates an overall sentiment score for each product attribute
The final output combines the information extracted from each of these features, which looks something like this:
- Battery life is excellent
- Design is modern
- Not satisfied with the comfort
The algorithm also recognizes spammy reviews and distinguishes subjective reviews (i.e., those fueled by emotion) from objective ones.

Promo Banner Analysis

Our image processing tool can interpret promotional banners and extract information regarding product highlights, discounts, and special offers. This provides insights into pricing strategies and promotional tactics used by other online stores.

For example, if a competitor offers a 20% discount on a popular product, you can match or exceed this discount to attract more customers.

The banner reader identifies successful promotional trends and patterns from competitors, such as the timing of discounts, frequently promoted product categories or brands, and the duration of sales events. Ecommerce stores can use this information to optimize their promotion strategies, ensuring they launch compelling and timely offers.

Other Specialized Use Cases

While these generalized AI tools are highly useful in various industries, we’ve created other category—and attribute-specific capabilities for specialty goods (e.g., those requiring certifications or approval by federal agencies) and food items. These use cases help our customers adhere to compliance requirements.

Certification Mark Detector

This detector lets retailers match items based on official certification marks. These marks represent compliance with industry standards, safety regulations, and quality benchmarks.

Example:
- USDA Organic: Certification for organic food production and handling
- ISO 9001: Quality Management System Certification
By detecting these certification marks, the system can accurately match products with their certified counterparts. By identifying which competitor products are certified, retailers can identify products that may benefit from certification.

Nutrition Fact Table Reader

Product attributes alone are insufficient for comparing food items. Differences in nutrition content can influence product category (e.g., “health food” versus regular food items), price point, and consumer choice. DataWeave’s nutrition fact table reader scans nutrition information on packaging, capturing details such as calorie count, macronutrient distribution (proteins, fats, carbohydrates), vitamins, and minerals.

The solution ensures items with similar nutritional profiles are correctly identified and grouped based on specific dietary requirements or preferences. This helps with price comparisons and enables eCommerce stores to maintain a reliable database of product information and build trust among health-conscious consumers.

Building Next-Generation Competitive and Market Intelligence

Moving forward, breakthroughs in generative AI and LLMs have fueled substantial innovation, which has enabled us to introduce powerful new capabilities for our customers.

These include:
- Building Enhanced Products, Solutions, and Capabilities: Generative AI and LLMs can significantly elevate the performance of existing solutions by improving the accuracy, relevance, and depth of insights. By leveraging these advanced AI technologies, DataWeave can enhance its product offerings, such as pricing intelligence, product matching, and sentiment analysis. These tools will become more intuitive, allowing for real-time updates and deeper contextual understanding. Additionally, AI can help create entirely new solutions tailored to specific use cases, such as automating competitive analysis or identifying emerging market trends. This positions DataWeave to remain at the forefront of innovation, offering cutting-edge solutions that meet the evolving needs of retailers and brands.
- Reducing Turnaround Time (TAT) to Go-to-Market Faster: Generative AI and LLMs streamline data processing and analysis workflows, enabling faster decision-making. By automating tasks like data aggregation, sentiment analysis, and report generation, AI dramatically reduces the time required to derive actionable insights. This efficiency means that businesses can respond to market changes more swiftly, adjusting pricing or promotional strategies in near real-time. Faster insights translate into reduced turnaround times for product development, testing, and launch cycles, allowing DataWeave to bring new solutions to market quickly and give clients a competitive advantage.
- Improving Data Quality to Achieve Higher Performance Metrics: AI-driven technologies are exceptionally skilled at cleaning, organizing, and structuring large datasets. Generative AI and LLMs can refine the data input process, reducing errors and ensuring more accurate, high-quality data across all touchpoints. Improved data quality enhances the precision of insights drawn from it, leading to higher performance metrics like better product matching, more accurate price comparisons, and more effective consumer sentiment analysis. With higher-quality data, businesses can make smarter, more informed decisions, resulting in improved revenue, market share, and customer satisfaction.
- Augmenting Human Bandwidth with AI to Enhance Productivity: Generative AI and LLMs serve as powerful tools that augment human capabilities by automating routine, time-consuming tasks such as data entry, classification, and preliminary analysis. This allows human teams to focus on more strategic, high-value activities like interpreting insights, building relationships with clients, and developing new business strategies. By offloading these repetitive tasks to AI, human productivity is significantly enhanced. Employees can achieve more in less time, increasing overall efficiency and enabling teams to scale their operations without needing a proportional increase in human resources.
In our ongoing series, we will dive deep into each of these capabilities, exploring how DataWeave leverages cutting-edge AI technologies like Generative AI and LLMs to solve complex challenges for retailers and brands.

In the meantime, talk to us to learn more!
September 18, 2024
Using Siamese Networks to Power Accurate Product Matching in eCommerce
Retailers often compete on price to gain market share in high performance product categories. Brands too must ensure that their in-demand assortment is competitively priced across retailers. Commerce and digital shelf analytics solutions offer competitive pricing insights at both granular and SKU levels. Central to this intelligence gathering is a vital process: product matching.

Product matching or product mapping involves associating identical or similar products across diverse online platforms or marketplaces. The matching process leverages the capabilities of Artificial Intelligence (AI) to automatically create connections between various representations of identical or similar products. AI models create groups or clusters of products that are exactly the same or “similar” (based on some objectively defined similarity criteria) to solve different use cases for retailers and consumer brands.

Accurate product matching offers several key benefits for brands and retailers:
- Competitive Pricing: By identifying identical products across platforms, businesses can compare prices and adjust their strategies to remain competitive.
- Market Intelligence: Product matching enables brands to track their products’ performance across various retailers, providing valuable insights into market trends and consumer preferences.
- Assortment Planning: Retailers can analyze their product range against competitors, identifying gaps or opportunities in their offerings.
Why Product Matching is Incredibly Hard

But product matching stands out as one of the most demanding technical processes for commerce intelligence tools. Here’s why:

Data Complexity

Product information comes in various (multimodal) formats – text, images, and sometimes video. Each format presents its own set of challenges, from inconsistent naming conventions to varying image quality.

Data Variance

The considerable fluctuations in both data quality and quantity across diverse product categories, geographical regions, and websites introduce an additional layer of complexity to the product matching process.

Industry Specific Nuances

Industry specific nuances introduce unique challenges to product matching. Exact matching may make sense in certain verticals, such as matching part numbers in industrial equipment or identifying substitute products in pharmaceuticals. But for other industries, exactly matched products may not offer accurate comparisons.
- In the Fashion and Apparel industry, style-to-style matching, accommodating variants and distinguishing between core sizes and non-core sizes and age groups become essential for accurate results.
- In Home Improvement, the presence of unbranded products, private labels, and the preference for matching sets rather than individual items complicates the process.
- On the other hand, for grocery, product matching becomes intricate due to the distinction between item pricing and unit pricing. Managing the diverse landscape of different pack sizes, quantities, and packaging adds further layers of complexity.
Diverse Downstream Use Cases

The diverse downstream business applications give rise to various flavors of product matching tailored to meet specific needs and objectives.

In essence, while product matching is a critical component in eCommerce, its intricacies demand sophisticated solutions that address the above challenges.

To solve these challenges, at DataWeave, we’ve developed an advanced product matching system using Siamese Networks, a type of machine learning model particularly suited for comparison tasks.

Siamese Networks for Product Matching

Our methodology involves the use of ensemble deep learning architectures. In such cases, multiple AI models are trained and used simultaneously to ensure highly accurate matches. These models tackle NLP (natural language processing) and Computer Vision challenges specific to eCommerce. This technology helps us efficiently narrow down millions of product candidates to just 5-15 highly relevant matches.

The Tech Powering Siamese Networks

The key to our approach is creating what we call “embeddings” – think of these as unique digital fingerprints for each product. These embeddings are designed to capture the essence of a product in a way that makes similar products easy to identify, even when they look slightly different or have different names.

Our system learns to create these embeddings by looking at millions of product pairs. It learns to make the embeddings for similar products very close to each other while keeping the embeddings for different products far apart. This process, known as metric learning, allows our system to recognize product similarities without needing to put every product into a rigid category.

This approach is particularly powerful for eCommerce, where we often need to match products across different websites that might use different names or images for the same item. By focusing on the key features that make each product unique, our system can accurately match products even in challenging situations.

How Siamese Networks Work?

Imagine having a pair of identical twins who are experts at spotting similarities and differences. That’s essentially what a Siamese network is – a pair of identical AI systems working together to compare things.

How it works:
- Twin AI systems: Two identical AI systems look at two different products.
- Creating ‘fingerprints’ or ‘embedding’: Each system creates a unique ‘fingerprint’ of the product it’s looking at.
- Comparison: These ‘fingerprints’ are then compared to see how similar the products are.
Architecture

The architecture of a Siamese network typically consists of three main components: the shared network, the similarity metric, and the contrastive loss function.
- Shared Network: This is the ‘brain’ that creates the product ‘fingerprints’ or ‘embeddings.’ It is responsible for extracting meaningful feature representations from the input samples. This network is composed of layers of neural units that work together. Weight sharing between the twin networks ensures that the model learns to extract comparable features for similar inputs, providing a basis for comparison.
- Similarity Metric: After the shared network processes the inputs, a similarity metric is employed. This decides how alike two ‘fingerprints’ or ‘embeddings’ are. The selection of a similarity metric depends on the specific task and characteristics of the input data. Frequently used similarity metrics include the Euclidean distance, cosine similarity, or correlation coefficient, each chosen based on its suitability for the given context and desired outcomes.
- Loss Function: For training the Siamese network, a specialized loss function is used. This helps the system improve its comparison skills over time. It guides and trains the network to generate akin embeddings for similar inputs and disparate embeddings for dissimilar inputs.
  
  This is achieved by imposing penalties on the model when the distance or dissimilarity between similar pairs surpasses a designated threshold, or when the distance between dissimilar pairs falls below another predefined threshold. This training strategy ensures that the network becomes adept at discerning and encoding the desired level of similarity or dissimilarity in its learned embeddings.
How DataWeave Uses Siamese Networks for Product Matching

At DataWeave, we use Siamese Networks to match products across different retailer websites. Here’s how it works:

Pre-processing (Image Preparation)
- We collect product images from various websites.
- We clean these images up to make them easier for our AI to understand.
- We use techniques like cropping, flipping, and adjusting colors to help our AI recognize products even if the images are slightly different.
Training The AI
- We show our AI system millions of product images, teaching it to recognize similarities and differences.
- We use a special learning method called “Triplet Loss” to help our AI understand which products are the same and which are different.
- We’ve tested different AI structures to find the one that works best for product matching, including ResNet, EfficientNet, NFNet, and ViT.
Image Retrieval
- Once trained, our AI creates a unique “fingerprint” for each product image.
- We store these fingerprints in a smart database.
- When we need to find a match for a product, we:
  - Create a fingerprint for the new product.
  - Quickly search our database for the most similar fingerprints.
  - Return the top matching products.
Matches are then assigned a high or a low similarity score and segregated into “Exact Matches” or “Similar Matches.” For example, check out the image of this white shoe on the left. It has a low similarity score with the pink shoe (below) and so these SKUs are categorized as a “Similar Match.” Meanwhile, the shoe on the right is categorized as an “Exact Match.”

Similarly, in the following image of the dress for a young girl, the matched SKU has a high similarity score and so this pair is categorized as an “Exact Match.”

Siamese Networks play a pivotal role in DataWeave’s Product Matching Engine. Amid the millions of images and product descriptions online, our Siamese Networks act as an equalizing force, efficiently narrowing down millions of candidates to a curated selection of 10-15 potential matches.

In addition, these networks also find application in several other contexts at DataWeave. They are used to train our system to understand text-only data from product titles and joint multimodal content from product descriptions.

Leverage Our AI-Driven Product Matching To Get Insightful Data

In summary, accurate and efficient product matching is no longer a luxury – it’s a necessity. DataWeave’s advanced product matching solution provides brands and retailers with the tools they need to navigate this complex landscape, turning the challenge of product matching into a competitive advantage.

By leveraging cutting-edge technology and simplifying it for practical use, we empower businesses to make informed decisions, optimize their operations, and stay ahead in the ever-evolving eCommerce market. To learn more, reach out to us today!
June 26, 2024
How AI-Powered Visual Highlighting Helps Brands Achieve Product Consistency Across eCommerce
As eCommerce increasingly becomes a prolific channel of sales for consumer brands, they find that maintaining a consistent and trustworthy brand image is a constant struggle. In an ecosystem filled with dozens of marketplaces and hundreds of third-party merchants, ensuring that customers see what aligns with a brand’s intended image is quite tricky. With many fakes and counterfeit products doing the rounds, brands may further struggle to get the right representation.

One way brands can track and identify inconsistencies in their brand representation across marketplaces is to use Digital Shelf Analytics solutions like DataWeave’s – specifically the Content Audit module.

This solution uses advanced AI models to identify image similarities and dissimilarities compared with the original brand image. Brands could then use their PIM platform or work with the retailer to replace inaccurate images.

But here’s the catch – AI can’t always accurately predict all the differences. Relying solely on scores given by these models poses a challenge in tracking the subtle differences between images. Often, image pairs with seemingly high match scores fail to catch important distinctions. Fake or counterfeit products and variations that slip past the AI’s scrutiny can lead to significant inaccuracies. Ultimately, it puts the reliability of the insights that brands depend on for crucial decisions at risk, impacting both top and bottom lines.

Dealing with this challenge means finding a balance between the number-based assessments of AI models and the human touch needed for accurate decision-making. However, giving auditors the ability to pinpoint variations precisely goes beyond simply sharing numerical values of the match scores with them. Visualizing model-generated scores is important as it provides human auditors with a tangible and intuitive understanding of the differences between two images. While numerical scores are comparable in the relative sense, they lack specificity. Visual interpretation empowers auditors to identify precisely where variations occur, aiding in efficient decision-making.

How AI-Powered Image Scoring Works

At DataWeave, our approach involves employing sophisticated computer vision models to conduct extensive image comparisons. Convolutional Neural Network (CNN) models such as Resnet-50 or YOLO, in conjunction with feature extraction models, analyze images quantitatively. This AI-powered image scoring process yields scores that indicate the level of similarity between images.

However, interpreting these scores and understanding the specific areas of difference can be challenging for human auditors. While computer vision models excel at processing vast amounts of data quickly, translating their output into actionable insights can be a stumbling block. A numerical score may not immediately convey the nature or extent of the differences between images

In the assessment of these images, all fall within the 70 to 80 range of scores (out of a maximum of 100). However, discerning the nature of differences—whether they are apparent or subtle—poses a challenge for the AI models and human auditors. For example, there are differences in the placement or type of images in the packaging, as well as packing text that are often in an extremely small font size. It is, of course, possible for human auditors to identify the differences in these images, but it’s a slow, error-prone, and tiring process, especially when auditors often have to check hundreds of image pairs each day.

So how do we ensure that we identify differences in images accurately? The answer lies in the process of visual highlighting.

How Visual Highlighting Works

Visual highlighting is a method that enhances our ability to comprehend differences in images by combining sophisticated algorithms with human understanding. Instead of relying solely on numerical scores, this approach introduces a visual layer, resembling a heatmap, guiding human auditors to specific areas where discrepancies are present.

Consider the scenario depicted in the images above: a computer vision model assigns a score of 70-85 for these images. While this score suggests relatively high similarity, it fails to uncover major differences between the images. Visual highlighting comes into play to overcome this limitation, precisely indicating regions where even subtle differences are seen.

Visual highlighting entails overlaying compared images and emphasizing areas of difference, achieved through techniques like color coding, outlining, or shading specific regions. The significance of the difference in a particular area determines the intensity of the visual highlight.

For instance, if there’s a change in the product’s color or a discrepancy in the packaging, these variations will be visually emphasized. This not only streamlines the auditing process but also enables human evaluators to make well-informed decisions quickly.

Benefits of Visual Highlighting
- Intuitive Understanding: Visual highlighting offers an intuitive method for interpreting and acting upon the outcomes of computer vision models. Instead of delving into numerical scores, auditors can concentrate on the highlighted areas, enhancing the efficiency and accuracy of the decision-making process.
- Accelerated Auditing: By bringing attention to specific regions of concern, visual highlighting speeds up the auditing process. Human evaluators can swiftly identify and address discrepancies without the need for exhaustive image analysis.
- Seamless Communication: Visual highlighting promotes clearer communication between automated systems and human auditors. Serving as a visual guide, it enhances collaboration, ensuring that the subtleties captured by computer vision models are effectively conveyed.
The Way Forward

As technology continues to evolve, the integration of visual highlighting methodologies is likely to become more sophisticated. Artificial intelligence and machine learning algorithms may play an even more prominent role in not only detecting differences but also in refining the visual highlighting process.

The collaboration between human auditors and AI ensures a comprehensive approach to maintaining brand integrity in the ever-expanding digital marketplace. By visually highlighting differences in images, brands can safeguard their visual identity, foster consumer trust, and deliver a consistent and reliable online shopping experience. In the intricate dance between technology and human intuition, visual highlighting emerges as a powerful tool, paving the way for brands to uphold their image with precision and efficiency.

To learn more, reach out to us today!

(This article was co-authored by Apurva Naik)
March 4, 2024
How DataWeave Enhances Transparency in Competitive Pricing Intelligence for Retailers
Retailers heavily depend on pricing intelligence solutions to consistently achieve and uphold their desired competitive pricing positions in the market. The effectiveness of these solutions, however, hinges on the quality of the underlying data, along with the coverage of product matches across websites.

As a retailer, gaining complete confidence in your pricing intelligence system requires a focus on the trinity of data quality:
- Accuracy: Accurate product matching ensures that the right set of competitor product(s) are correctly grouped together along with yours. It ensures that decisions taken by pricing managers to drive competitive pricing and the desired price image are based on reliable apples-to-apples product comparisons.
- Freshness: Timely data is paramount in navigating the dynamic market landscape. Up-to-date SKU data from competitors enables retailers to promptly adjust pricing strategies in response to market shifts, competitor promotions, or changes in customer demand.
- Product matching coverage: Comprehensive product matching coverage ensures that products are thoroughly matched with similar or identical competitor products. This involves accurately matching variations in size, weight, color, and other attributes. A higher coverage ensures that retailers seize all available opportunities for price improvement at any given time, directly impacting revenues and margins.
However, the reality is that untimely data and incomplete product matches have been persistent challenges for pricing teams, compromising their pricing actions. Inaccurate or incomplete data can lead to suboptimal decisions, missed opportunities, and reduced competitiveness in the market.

What’s worse than poor-quality data? Poor-quality data masquerading as accurate data.

In many instances, retailers face a significant challenge in obtaining comprehensive visibility into crucial data quality parameters. If they suspect the data quality of their provider is not up to the mark, they are often compelled to manually request reports from their provider to investigate further. This lack of transparency not only hampers their pricing operations but also impedes the troubleshooting process and decision-making, slowing down crucial aspects of their business.

We’ve heard about this problem from dozens of our retail customers for a while. Now, we’ve solved it.

DataWeave’s Data Statistics and SKU Management Capability Enhances Data Transparency

DataWeave’s Data Statistics Dashboard, offered as part of our Pricing Intelligence solution, enables pricing teams to gain unparalleled visibility into their product matches, SKU data freshness, and accuracy.

It enables retailers to autonomously assess and manage SKU data quality and product matches independently—a crucial aspect of ensuring the best outcomes in the dynamic landscape of eCommerce.

Beyond providing transparency and visibility into data quality and product matches, the dashboard facilitates proactive data quality management. Users can flag incorrect matches and address various data quality issues, ensuring a proactive approach to maintaining the highest standards.

Retailers can benefit in several ways with this dashboard, as listed below.

View Product Match Rates Across Websites

The dashboard helps retailers track match rates to gauge their health. High product match rates signify that pricing teams can move forward in their pricing actions with confidence. Low match rates would be a cause for further investigation, to better understand the underlying challenges, perhaps within a specific category or competitor website.

Our dashboard presents both summary statistics on matches and data crawls as well as detailed snapshots and trend charts, providing users with a holistic and detailed perspective of their product matches.

Additionally, the dashboard provides category-wise snapshots of reference products and their matching counterparts across various retailers, allowing users to focus on areas with lower match rates, investigate underlying reasons, and develop strategies for speedy resolution.

Track Data Freshness Easily

The dashboard enables pricing teams to monitor the timeliness of pricing data and assess its recency. In the dynamic realm of eCommerce, having up-to-date data is essential for making impactful pricing decisions. The dashboard’s presentation of freshness rates ensures that pricing teams are armed with the latest product details and pricing information across competitors.

Within the dashboard, users can readily observe the count of products updated with the most recent pricing data. This feature provides insights into any temporary data capture failures that may have led to a decrease in data freshness. Armed with this information, users can adapt their pricing decisions accordingly, taking into consideration these temporary gaps in fresh data. This proactive approach ensures that pricing strategies remain agile and responsive to fluctuations in data quality.

Proactively Manage Product Matches

The dashboard provides users with proactive control over managing product matches within their current bundles via the ‘Data Management’ panel. This functionality empowers users to verify, add, flag, or delete product matches, offering a hands-on approach to refining the matching process. Despite the deployment of robust matching algorithms that achieve industry-leading match rates, occasional instances may arise where specific matches are overlooked or misclassified. In such cases, users play a pivotal role in fine-tuning the matching process to ensure accuracy.

The interface’s flexibility extends to accommodating product variants and enables users to manage product matches based on store location. Additionally, the platform facilitates bulk match uploads, streamlining the process for users to efficiently handle large volumes of matching data. This versatility ensures that users have the tools they need to navigate and customize the matching process according to the nuances of their specific product landscape.

Gain Unparalleled Visibility into your Data Quality

With DataWeave’s Pricing Intelligence, users gain the capability to delve deep into their product data, scrutinize match rates, assess data freshness, and independently manage their product matches. This approach is instrumental in fostering informed and effective decisions, optimizing inventory management, and securing a competitive edge in the dynamic world of online retail.

To learn more, reach out to us today!
January 22, 2024
Why Unit of Measure Normalization is Critical For Accurate and Actionable Competitive Pricing Intelligence

Competitive pricing intelligence is pivotal for retailers seeking to analyze their product pricing in relation to competitors. This practice is essential for ensuring that their product range maintains a competitive edge, meeting both customer expectations and market demands consistently.

Product matching serves as a foundational element within any competitive pricing intelligence solution. Products are frequently presented in varying formats across different websites, featuring distinct titles, images, and descriptions. Undertaking this process at a significant scale is highly intricate due to numerous factors. One such complication arises from the fact that products are often displayed with differing units of measurement on various websites.

The Challenge of Varying Units

In certain product categories, retailers often offer the same item in varying volumes, quantities, or weights. For instance, a clothing item might be available as a single piece or in packs of 2 or 3, while grocery brands commonly sell eggs in counts of 6, 12, or 24.

Consider this example: a quick glance might suggest that an 850g pack of Kellogg’s Corn Flakes priced at $5 is a better deal than a 980g pack of Nestle Cornflakes priced at $5.2. However, this assumption can be deceptive. In reality, the latter offers better value for your money, a fact that only becomes evident through price comparisons after standardizing the units.

This issue is particularly relevant due to the prevalence of “shrinkflation,” where brands adjust packaging sizes or quantities to offset inflation while keeping prices seemingly low. When quantities, pack sizes, weight, etc. reduce instead of prices increasing, it’s important that this change is considered while analyzing competitive pricing.

Normalizing Units of Measure

In order to effectively compare prices among different competitors, retailers must standardize the diverse units of measurement they encounter. This standardization (or normalization) is crucial because price comparisons should extend beyond individual product SKUs to accommodate variations in package sizes and quantities. It’s essential to normalize units, ranging from “each” (ea) for individual items to “dozen” (dz) for sets, and from “pounds” (lb), “kilograms” (kg), “liters” (ltr), to “gallons” (gal) for various product types.

For example, a predetermined base unit of measure, such as 100 grams for a specific product like cornflakes, serves as the reference point. The unit-normalized price for any cornflake product would then be the price per 100 grams. In the example provided, this reveals that Kellogg’s is priced at $0.59 per 100 grams, while Nestle is priced at $0.53 per 100 grams.

Various Categories of Unit Normalization

1. Weight Normalization

Retailers frequently feature products with weight measurements expressed in grams (g), kilograms (kg), pounds (lbs), or ounces (oz).

2. Quantity or Pack Size Normalization

Products are also often featured with varying pick sizes or quantities in each SKU.

3. Volume or Capacity Normalization

Products can also vary in volumes or capacities with units like liters (L) or fluid ounces (fl oz).

DataWeave’s Unit Normalized Pricing Intelligence Solution

DataWeave’s highly sophisticated product matching engine can match the same or similar products and normalize their units of measurement, leading to highly accurate and actionable competitive pricing insights. It standardizes different units of measurement, like weight, quantity, and volume, ensuring fair comparisons across similar and exact matched products.

Retailers have the flexibility to view pricing insights either with retailer units or normalized units. This capability empowers retailers and analysts to perform accurate, in-depth analyses of pricing information at a product level.

In some scenarios, analyzing unit normalized pricing reflects pricing trends and competitiveness more accurately than retail price alone. This is particularly true for categories like CPG, where products are sold in diverse units of measure. For instance, in the example shown here, we can view a comparison of price position trends for the category of Fruits and Vegetables based on both retail price and unit price.

The difference is striking: the original retail price based analysis shows a stagnation in price position, whereas unit normalized pricing analysis reflects a more dynamic pricing scenario.

With DataWeave, retailers can specify which units to compare, ensuring that comparisons are made accurately. For example, a retailer can specify that unit price comparisons apply only to 8, 12, or 16-ounce packs, as well as 1 or 3-pound packs, but not to 10 and 25-pound bags. This precision ensures that products are matched correctly, and prices are represented for appropriately normalized units, leading to more accurate pricing insights.

To learn more about this capability, write to us at contact@dataweave.com or visit our website today!

November 6, 2023
How Apache Airflow Optimizes Complex Workflows in DataWeave’s Technology Platform
As successful businesses grow, they add a large number of people, customers, tools, technologies, etc. and roll out processes to manage the ever-increasingly complex landscape. Automation ensures that these processes are run in a smooth, efficient, swift, accurate, and cost-effective manner. To this end, Workflow Management Systems (WMS) aid businesses in rolling out an automated platform that manages and optimizes business processes at large scale.

While workflow management, in itself, is a fairly intricate undertaking, the eventual improvements in productivity and effectiveness far outweigh the effort and costs.

At DataWeave, on a normal day, we collect, process and generate business insights on terabytes of data for our retail and brand customers. Our core data pipeline ensures consistent data availability for all downstream processes including our proprietary AI/ ML layer. While the data pipeline itself is generic and serves standard workflows, there has been a steady surge in customer-specific use case complexities and the variety of product offerings over the last few years.

A few months ago, we recognized the need for an orchestration engine. This engine would serve to manage the vast volumes of data received from customers, capture data from competitor websites (which can range in complexity and from 2 to 130+ in number), run the required data transformations, execute the product matching algorithm through our AI systems, process the output through a layer of human verification, generate actionable business insights, feed the insights to reports and dashboards, and more. In addition, this engine would be required to help us manage the diverse customer use cases in a consistent way.

As a result, we launched a hunt for a suitable WMS. We needed the system to satisfy several criteria:
- Ability to manage our complex pipeline, which has several integrations and tech dependencies
- Extendable system that enables us to operate with multiple types of databases, internal apps, utilities, and APIs
- Plug and play interfaces to execute custom scripts, and QA options at each step
- Operates with all cloud services
- Addresses the needs of both ‘Batch’ and ‘Near Real Time’ processes
- Generates meaningful feedback and stats at every step of the workflow
- Helps us get away with numerous crontabs, which are hard to manage
- Execute workflows repeatedly in a consistent and precise manner
- Ability to combine multiple complex workflows and conditional branching of workflows
- Provides integrations with our internal project tracking and messaging tools such as, Slack and Jira, for immediate visibility and escalations
- A fallback mechanism at each step, in case of any subsystem failures.
- Fits within our existing landscape and doesn’t mandate significant alterations
- Should support autoscaling since we have varying workloads (the system should scale the worker nodes on-demand)
On evaluating several WMS providers, we zeroed in on Apache Airflow. Airflow satisfies most of our needs mentioned above, and we’ve already onboarded tens of enterprise customer workflows onto the platform.

In the following sections, we will cover our Apache Airflow implementation and some of the best practices associated with it.

DataWeave’s Implementation

Components

Broker: A 3 node Rabbit-MQ cluster for high availability. There are 2 separate queues maintained, one for SubDags and one for tasks, as SubDags are very lightweight processes. While they occupy a worker slot, they don’t do any meaningful work apart from waiting for their tasks to complete.

Meta-DB: MetaDB is one of the most crucial components of Airflow. We use RDS-MySQL for the managed database.

Controller: The controller consists of the scheduler, web server, file server, and the canary dag. This is hosted in a public subnet.

Scheduler and Webserver: The scheduler and webserver are part of the standard airflow services.

File Server: Nginx is used as a file server to serve airflow logs and application logs.

Canary DAG: The canary DAG mimics the actual load on our workers. It runs every 30 minutes and checks the health of the scheduler and the workers. If the task is not queued at all or has spent more time in the queued state than expected, then either the scheduler or the worker is not functioning as expected. This will trigger an alert.

Workers: The workers are placed in a private subnet. A general-purpose AWS machine with two types of workers is configured, one for sub-DAGs and one for tasks. The workers are placed in an EC2-Autoscaling group and the size of the group will either grow or shrink depending on the current tasks that are executed.

Autoscaling of workers

Increasing the group size: A lambda is triggered in a periodic interval and it checks the length of the RMQ queue. The lambda also knows the current number of workers in the current fleet of workers. Along with that, we also log the average run time of tasks in the DAG. Based on these parameters, we either increase or decrease the group size of the cluster.

Reducing the group size: When we decrease the number of workers, it also means any of the workers can be taken down and the worker needs to be able to handle it. This is done through termination hooks. We follow an aggressive scale-up policy and a conservative scale-down policy.

File System: We use EFS (Elastic File System) of AWS as the file system that is shared between the workers and the controller. EFS is a managed NAS that can be mounted on multiple services. By using EFS, we have ensured that all the logs are present in one file system and these logs are accessible from the file server present in the controller. We have put in place a lifecycle policy on EFS to archive data older than 7 days.

Interfaces: To scale up the computing platform when required, we have a bunch of hooks, libraries, and operators to interact with external systems like Slack, EMR, Jira, S3, Qubole, Athena, and DynamoDB. Standard interfaces like Jira and Slack also help in onboarding the L2 support team. The L2 support relies on Jira and Slack notifications to monitor the DAG progress.

Deployment

Deployment in an airflow system is fairly challenging and involves multi-stage deployments.

Challenges:
- If we first deploy the controller and if there are any changes in the DAG, the corresponding tasks may not be present in workers. This may lead to a failure.
- We have to make blue-green deployments as we cannot deploy on the workers where tasks may still be running. Once the worker deployments are done, the controller deployment takes place. If it fails for any reason, both the deployments will be rolled back.
We use an AWS code-deploy to perform these activities.

Staging and Development

For development, we use a docker container from Puckel-Airflow. We have made certain modifications to change the user_id and also to run multiple docker containers on the same system. This will help us to test all the new functionality at a DAG level.

The staging environment is exactly like the development environment, wherein we have isolated our entire setup in separate VPCs, IAM policies, S3-Buckets, Athena DBs, Meta-DBs, etc. This is done to ensure the staging environment doesn’t interfere with our production systems. The staging setup is also used to test the infra-level changes like autoscaling policy, SLAs, etc.

In Summary

Following the deployment of Apache Airflow, we have onboarded several enterprise customers across our product suite and seen up to a 4X improvement in productivity, consistency and efficiency. We have also built a sufficient set of common libraries, connectors, and validation rules over time, which takes care of most of our custom, customer-specific needs. This has enabled us to roll out our solutions much faster and with better ROI.
As Airflow has been integrated to our communications and project tracking systems, we now have much faster and better visibility on current statuses, issues with sub processes, and duration-based automation processes for escalations.
At the heart of all the benefits we’ve derived is the fact that we have now achieved much higher consistency in processing large volumes of diverse data, which is one of DataWeave’s key differentiators.
In subsequent blog posts, we will dive deeper into specific areas of this architecture to provide more details. Stay tuned!
May 13, 2020
Flaunt Your Deep-Tech Prowess at Bootstrap Paradox Hackathon Hosted by Blume Ventures
When DataWeave was founded in 2011, we set out to democratize data by enabling businesses to leverage public Web data to solve mission-critical business problems. Eight years on, we have done just that, and grown to deliver AI-powered competitive intelligence and digital shelf analytics to several global retailers and brands, which include the likes of Adidas, QVC, Overstock, Sauder, Dorel, and more.

As the company has grown, so has our team, which is now 140+ members strong. We’re still constantly on the lookout for smart, open, and driven folks to join us and contribute to our success.

And so, we’re excited to partner with Skillenza and Blume Ventures to co-host the Bootstrap Paradox Hackathon, where we are eager to engage with the developer community and contribute in our own way back to the startup ecosystem.

The event will be conducted as an offline product building competition, with a duration of 24 hours on August 3-4, 2019 at the Microsoft India office in Bengaluru. It will provide a platform for developers and coders to interact with and solve challenges thrown up by DataWeave and other Blume portfolio companies, such as Dunzo, Unacademy, Milkbasket, Mechmocha, and Locus.

Taking up DataWeave’s challenge during this Hackathon will give you a sneak peek into what our team works on daily. It’s no surprise that we have “At DataWeave, it’s a Hackathon every day!” plastered on our walls. After all, it’s not just all about intense work, but also a lot of fun and frolic.

The problems that we deal with are as exciting as they are hard. Some of our key accomplishments in technology include:
- Matching products across e-commerce websites at massive scale and at high levels of accuracy and coverage
- Using Computer Vision to detect product attributes in fashion such as a color, sleeve length, collar type, etc. by analyzing catalog images
- Aggregating data from complex web environments, including mobile apps, and across 25+ international languages
One of our more recent innovations has been in optimizing e-commerce product discovery engines, which dramatically improves shopper experience and purchase conversion rates. During the Bootstrap Paradox Hackathon, coders will get a chance to build a similar engine, with guidance and assistance from DataWeave’s technology leaders.

Data sets containing product information like title, description, image URL, price, category etc. will be provided, and coders will need to clean up the data, extract information on relevant product attributes and features, and index them, in the process of building the product discovery engine.

For more details on the challenge, register here on the Skillenza platform.

As a sweetener, the event also promises everyone a chance to win over 10 lakhs in prize money.

Simply put, if you love code, this is the place to be this weekend. See you there!
August 1, 2019
Dataweave – CherryPy vs Sanic: Which Python API Framework is Faster?
Rest APIs play a crucial role in the exchange of data between internal systems of an enterprise, or when connecting with external services.

When an organization relies on APIs to deliver a service to its clients, the APIs’ performance is crucial, and can make or break the success of the service. It is, therefore, essential to consider and choose an appropriate API framework during the design phase of development. Benefits of choosing the right API framework include the ability to deploy applications at scale, ensuring agility of performance, and future-proofing front-end technologies.

At DataWeave, we provide Competitive Intelligence as a Service to retailers and consumer brands by aggregating Web data at scale and distilling them to produce actionable competitive insights. To this end, our proprietary data aggregation and analysis platform captures and compiles over a hundred million data points from the Web each day. Sure enough, our platform relies on APIs to deliver data and insights to our customers, as well as for communication between internal subsystems.

Some Python REST API frameworks we use are:
- Tornado — which supports asynchronous requests
- CherryPy — which is multi-threaded
- Flask-Gunicorn — which enables easy worker management
It is essential to evaluate API frameworks depending on the demands of your tech platforms and your objectives. At DataWeave, we assess them based on their speed and their ability to support high concurrency. So far, we’ve been using CherryPy, a widely used framework, which has served us well.

CherryPy

An easy to use API framework, Cherrypy does not require complex customizations, runs out of the box, and supports concurrency. At DataWeave, we rely on CherryPy to access configurations, serve data to and from different datastores, and deliver customized insights to our customers. So far, this framework has displayed very impressive performance.

However, a couple of months ago, we were in the process of migrating to python 3 (from python 2), opening doors to a new API framework written exclusively for python 3 — Sanic.

Sanic

Sanic uses the same framework that libuv uses, and hence is a good contender for being fast.

(Libuv is an asynchronous event handler, and one of the reasons for its agility is its ability to handle asynchronous events through callbacks. More info on libuv can be found here)

In fact, Sanic is reported to be one of the fastest API frameworks in the world today, and uses the same event handler framework as nodejs, which is known to serve fast APIs. More information on Sanic can be found here.

So we asked ourselves, should we move from CherryPy to Sanic?

Before jumping on the hype bandwagon, we looked to first benchmark Sanic with CherryPy.

CherryPy vs Sanic

Objective

Benchmark CherryPy and Sanic to process 500 concurrent requests, at a rate of 3500 requests per second.

Test Setup
```
Machine configuration: 4 VCPUs/ 8GB RAM.
Network Cloud: GCE
Number of Cherrypy/Sanic APIs: 3 (inserting data into 3 topics of a Kafka cluster)
Testing tool : apache benchmarking (ab)
Payload size: All requests are POST requests with 2.1KB of payload.
```
API Details
```
Sanic: In Async mode
Cherrypy: 10 concurrent threads in each API — a total of 30 concurrent threads
Concurrency: Tested APIs at various concurrency levels. The concurrency varied between 10 and 500
Number of requests: 1,00,000
```
Results

Requests Completion: A lower mean and a lower spread indicate better performance

Observation

When the concurrency is as low as 10, there is not much difference between the performance of the two API frameworks. However, as the concurrency increases, Sanic’s performance becomes more predictable, and the API framework functions with lower response times.

Requests / Second: Higher values indicate faster performance

Sanic clearly achieves higher requests/second because:
- Sanic is running in Async mode
- The mean response time for Sanic is much lower, compared to CherryPy
Failures: Lower values indicate better reliability

Number of non-2xx responses increased for CherryPy with increase in concurrency. In contrast, number of failed requests in Sanic were below 10, even at high concurrency values.

Conclusion

Sanic clearly outperformed CherryPy, and was much faster, while supporting higher concurrency and requests per second, and displaying significantly lower failure rates.

Following these results, we transitioned to Sanic for ingesting high volume data into our datastores, and started seeing much faster and reliable performance. We now aggregate much larger volumes of data from the Web, at faster rates.

Of course, as mentioned earlier in the article, it is important to evaluate your API framework based on the nuances of your setup and its relevant objectives. In our setup, Sanic definitely seems to perform better than CherryPy.

What do you think? Let me know your thoughts in the comments section below.

If you’re curious to know more about DataWeave’s technology platform, check out our website, and if you wish to join our team, check out our jobs page!
January 24, 2018
Implement a Machine-Learning Product Classification System
For online retailers, price competitiveness and a broad assortment of products are key to acquiring new customers, and driving customer retention. To achieve these, they need timely, in-depth information on the pricing and product assortment of competing retailers. However, in the dynamic world of online retail, price changes occur frequently, and products are constantly added, removed, and running out of stock, which impede easy access to harnessing competitive information.

At DataWeave, we address this challenge by providing retailers with competitive pricing and assortment intelligence, i.e. information on their pricing and assortment, in comparison to their competition’s.

The Need for Product Classification

On acquiring online product and pricing data across websites using our proprietary data acquisition platform, we are tasked with representing this information in an easily consumable form. For example, retailers need product and pricing information along multiple dimensions, such as — the product categories, types, etc. in which they are the most and least price competitive, or the strengths and weaknesses of their assortment for each category, product type, etc.

Therefore, there is a need to classify the products in our database in an automated manner. However, this process can be quite complex, since in online retail, every website has its own hierarchy of classifying products. For example, while “Electronics” may be a top-level category on one website, another may have “Home Electronics”, “Computers and Accessories”, etc. as top-level categories. Some websites may even have overlaps between categories, such as “Kitchen and Furniture” and “Kitchen and Home Appliances”.

Addressing this lack of standardization in online retail categories is one of the fundamental building blocks of delivering information that is easily consumable and actionable.

We, therefore, built a machine-learning product classification system that can predict a normalized category name for a product, given an unstructured textual representation. For example:
- Input: “Men’s Wool Blend Sweater Charcoal Twist and Navy and Cream Small”
- Output: “Clothing”
- Input: “Nisi 58 mm Ultra Violet UV Filter”
- Output: “Cameras and Accessories”
To classify categories, we first created a set of categories that was inclusive of variations in product titles found across different websites. Then, we moved on to building a classifier based on supervised learning.

What is Supervised Learning?

Supervised learning is a type of machine learning in which we “train” a product classification system by providing it with labelled data. To classify products, we can use product information, along with the associated category as label, to train a machine learning model. This model “learns” how to classify new, but similar products into the categories we train it with.

To understand how product information can be used to train the model, we identified what data points about products we can use, and the challenges associated with using it.

For example, this is what a product’s record looks like in our database:
```
{
“title”: “Apple MacBook Pro Retina Display 13.3” 128 GB SSD 8 GB RAM”,
“website”: “Amazon”,
“meta”: “Electronics > Computer and Accessories > Laptops > Macbooks”,
“price”: “83000”
}
```
Here, “title” is unstructured text for a product. The hierarchical classification of the product on the given website is shown by “meta”.

This product’s “title” can be represented in a structured format as:
```
{
“Brand”: “Apple”,
“Screen Size”: “13.3 inches”,
“Screen Type”: “Retina Display”,
“RAM”: “8 GB”,
“Storage”: “128 GB SSD”
}
```
In this structured object, “Brand”, “Screen Size”, “Screen Type” and so on are referred to as “attributes”. Their associated items are referred to as “values”.

Challenges of Working with Text

Lack of uniformity in product titles across websites –

In the example shown above, the given structured object is only one way of structuring the given unstructured text (title). The product title would likely change for every website it’s represented on. What’s worse, some websites lack any form of structured representation. Also, attributes and values may have different representations on different websites — ‘RAM’ may be referred to as ‘Memory’.

Absence of complete product information –

Not all websites provide complete product information in the title. Even when structured information is provided, the level of detail may vary across websites.

Since these challenges are substantial, we chose to use unstructured titles of products as training inputs for supervised learning.

Pre-processing and Vectorisation of Training Data

Pre-processing of titles can be done as follows:
- Lowercasing
- Removing special characters
- Removing stop words (like ‘and’, ‘by’, ‘for’, etc.)
- Generating unigram and bigram tokens
- We represented the title as a vector using the Bag of Words model, with unigram and bigram tokens.
The Algorithm

We used Support Vector Machine (SVM) and compared the results with Naive Bayes Classifiers, Decision Trees and Random Forest.

Training Data Generation

The total number of product data we’ve acquired runs into the hundreds of millions, and every category has a different number of products. For example, we may have 40 million products in “Clothing” category but only 2 million products in the “Sports and Fitness” category. We used a stratified sampling technique to ensure that we got a subset of the data that captures the maximum variation in the entire data.

For each category, we included data from most websites that contained products of that category. Within each website, we included data from all subcategories and product types. The size of the data-set we used is about 10 million, sourced from 40 websites. We then divided our labelled data-set into two parts: training data-set and testing data-set.

Evaluating the Model

After training with the training dataset, we tested this machine-learning classification system using the testing dataset to find the accuracy of the model.

Clearly, SVM generated the best accuracy compared to the other classifiers.

Performance Statistics
- System Specifications: 8-Core system (Intel(R) Xeon(R) CPU E3–1231 v3 @ 3.40GHz) with 32 GB RAM
- Training Time: 90 minutes (approximately)
- Prediction Time: Approximately 6 minutes to classify 1 million product titles. This is equivalent to about 3000 titles per second.
Example Inputs and Outputs from the SVM Model (with Decision Values)
- Input: “Washing Machine Top Load”
Output: {“Home Appliances”: 1.45, “Home and Living”: 0.60, “Tools and Hardware”: 0.54}
- Input: “Nisi 58 mm Ultra Violet UV Filter”
Output: {“Cameras and Accessories”: 1.46, “Eyewear”: 1.14, “Home and Living”: 1.12}
- Input: “NETGEAR AirCard AC778AT Around Town Mobile Internet — Mobile hot”
Output: {“Computers and Accessories”: 0.82, “Books”: 0.61, “Toys”: 0.27}
- Input: “Nike Sports Tee”
Output: {“Sports and Fitness”: 1.63, “Footwear”: 0.63, “Toys and Baby Products”: 0.59}

Largely, most of the outputs were accurate, which is no mean feat. Some incorrect outputs were those of fairly similar categories. For example, “Home and Living” was predicted for products that should have ideally been part of “Home Appliances”. Other incorrect predictions occurred when the input was ambiguous.

There were also scenarios where the output decision values of the top two categories were quite close (as shown in the third example above), especially when the input was vague. In the last example above, the product should have been classified as “Clothing”, but got classified as “Sports and Fitness” instead, which is not entirely incorrect.

Delivering Value with Competitive Intelligence

The category classifier elucidated in this article is only the first element of a universal product organization system that we’ve built at DataWeave. The output of our category classification system is used by other in-house machine-learning and heuristic-based systems to generate more detailed product categories, types, subcategories, attributes, and the like.

Our universal product organization system is the backbone of the Competitive Pricing and Assortment Intelligence solutions we provide to online retailers, which enable them to evaluate their pricing and assortment against competitors along multiple dimensions, helping them compete effectively in the cutthroat eCommerce space.

Click here to find out more about DataWeave’s solutions and how modern retailers harness the power of data to drive revenue and margins.
June 22, 2017
Dataweave – Smartphones vs Tablets: Does size matter?
Smartphones vs Tablets: Does size matter?

We have seen a steady increase in the number of smartphones and tablets since the last five years. Looking at the number of smartphones, tablets and now wearables ( smart watches and fitbits ) that are being launched in the mobiles market, we can truly call this ‘The Mobile Age’.

We, at DataWeave, deal with millions of data points related to products which vary from electronics to apparel. One of the main challenges we encounter while dealing with this data is the amount of noise and variation present for the same products across different stores.

One particular problem we have been facing recently is detecting whether a particular product is a mobile phone (smartphone) or a tablet. If it is mentioned explicitly somewhere in the product information or metadata, we can sit back and let our backend engines do the necessary work of classification and clustering. Unfortunately, with the data we extract and aggregate from the Web, chances of finding this ontological information is quite slim.

To address the above problem, we decided to take two approaches.
- Try to extract this information from the product metadata
- Try to get a list of smartphones and tablets from well known sites and use this information to augment the training of our backend engine
Here we will talk mainly about the second approach since it is more challenging and engaging than the former. To start with, we needed some data specific to phone models, brands, sizes, dimensions, resolutions and everything else related to the device specifications. For this, we relied on a popular mobiles/tablets product information aggregation site. We crawled, extracted and aggregated this information and stored it as a JSON dump. Each device is represented as a JSON document like the sample shown below.
```
{ "Body": { "Dimensions": "200 x 114 x 8.7 mm", "Weight": "290 g (Wi-Fi), 299 g (LTE)" }, "Sound": { "3.5mm jack ": "Yes", "Alert types": "N/A", "Loudspeaker ": "Yes, with stereo speakers" }, "Tests": { "Audio quality": "Noise -92.2dB / Crosstalk -92.3dB" }, "Features": { "Java": "No", "OS": "Android OS, v4.3 (Jelly Bean), upgradable to v4.4.2 (KitKat)", "Chipset": "Qualcomm Snapdragon S4Pro", "Colors": "Black", "Radio": "No", "GPU": "Adreno 320", "Messaging": "Email, Push Email, IM, RSS", "Sensors": "Accelerometer, gyro, proximity, compass", "Browser": "HTML5", "Features_extra detail": "- Wireless charging- Google Wallet- SNS integration- MP4/H.264 player- MP3/WAV/eAAC+/WMA player- Organizer- Image/video editor- Document viewer- Google Search, Maps, Gmail,YouTube, Calendar, Google Talk, Picasa- Voice memo- Predictive text input (Swype)", "CPU": "Quad-core 1.5 GHz Krait", "GPS": "Yes, with A-GPS support" }, "title": "Google Nexus 7 (2013)", "brand": "Asus", "General": { "Status": "Available. Released 2013, July", "2G Network": "GSM 850 / 900 / 1800 / 1900 - all versions", "3G Network": "HSDPA 850 / 900 / 1700 / 1900 / 2100 ", "4G Network": "LTE 800 / 850 / 1700 / 1800 / 1900 / 2100 / 2600 ", "Announced": "2013, July", "General_extra detail": "LTE 700 / 750 / 850 / 1700 / 1800 / 1900 / 2100", "SIM": "Micro-SIM" }, "Battery": { "Talk time": "Up to 9 h (multimedia)", "Battery_extra detail": "Non-removable Li-Ion 3950 mAh battery" }, "Camera": { "Video": "Yes, 1080p@30fps", "Primary": "5 MP, 2592 x 1944 pixels, autofocus", "Features": "Geo-tagging, touch focus, face detection", "Secondary": "Yes, 1.2 MP" }, "Memory": { "Internal": "16/32 GB, 2 GB RAM", "Card slot": "No" }, "Data": { "GPRS": "Yes", "NFC": "Yes", "USB": "Yes, microUSB (SlimPort) v2.0", "Bluetooth": "Yes, v4.0 with A2DP, LE", "EDGE": "Yes", "WLAN": "Wi-Fi 802.11 a/b/g/n, dual-band", "Speed": "HSPA+, LTE" }, "Display": { "Multitouch": "Yes, up to 10 fingers", "Protection": "Corning Gorilla Glass", "Type": "LED-backlit IPS LCD capacitive touchscreen, 16M colors", "Size": "1200 x 1920 pixels, 7.0 inches (~323 ppi pixel density)" } }
```
From the above document, it is clear that there are a lot of attributes that can be assigned to a mobile device. However, we would not need all of them for building our simple algorithm for labeling smartphones and tablets. I had decided to use the device screen size for separating out smartphones vs tablets, but I decided to take some suggestions from our team. After sitting down and taking a long, hard look at our dataset, Mandar had an idea of using the device dimensions also for achieving the same goal!

Finally, the attributes that we decided to use were,
- Size
- Title
- Brand
- Device dimensions
Screen sizeI wrote some regular expressions for extracting out the features related to the device screen size and resolution. Getting the resolution was easy, which was achieved with the following Python code snippet. There were a couple of NA values but we didn’t go out of our way to get the data by searching on the web because resolution varies a lot and is not a key attribute for determining if a device is a phone or a tablet.
```
size_str = repr(doc["Display"]["Size"]) resolution_pattern = re.compile(r'(?:\S+\s)x\s(?:\S+\s)\s?pixels') if resolution_pattern.findall(size_str): resolution = ''.join([token.replace("'","") for token in resolution_pattern.findall(size_str)[0].split()[0:3]]) else: resolution = 'NA'
```
But the real problems started when I wrote regular expressions for extracting the screen size. I started off with analyzing the dataset and it seemed that screen size was mentioned in inches so I wrote the following regular expression for getting screen size.

size_str = repr(doc[“Display”][“Size”]) screen_size_pattern = re.compile(r'(?:\S+\s)\s?inches’) if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0].split()[0] else: screen_size = ‘NA’

However, I noticed that I was getting a lot of ‘NA’ values for many devices. On looking up the same devices online, I noticed there were three distinct patterns with regards to screen size. They are,
- Screen size in ‘inches’
- Screen size in ‘lines’
- Screen size in ‘chars’ or ‘characters’
Now, some of you might be wondering what on earth do ‘lines’ and ‘chars’ mean and how do they measure screen size. On digging it up, I found that basically both of them mean the same thing but in different formats. If we have ‘n lines’ as the screen size, it means, the screen can display at most ‘n’ lines of text at any instance of time. Likewise, if we have ‘n x m chars’ as the screen size, it means the device can diaplay ‘n’ lines of text at any instance of time with each line having a maximum of ‘m’ characters. The picture below will make things more clear. It represents a screen of 4 lines or 4 x 20 chars.

Thus, the earlier logic for extracting screen size had to be modified and we used the following code snippet. We had to take care of multiple cases in our regexes, because the data did not have a consistent format.

Thus, the earlier logic for extracting screen size had to be modified and we used the following code snippet. We had to take care of multiple cases in our regexes, because the data did not have a consistent format.
```
size_str = repr(doc["Display"]["Size"]) screen_size_pattern = re.compile(r'(?:\S+\s)\s?inc[h|hes]') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0] .replace("'","").split()[0]+' inches' else: screen_size_pattern = re.compile(r'(?:\S+\s)\s?lines') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0] .replace("'","").split()[0]+' lines' else: screen_size_pattern = re.compile(r'(?:\S+\s)x\s(?:\S+\s)\s?char[s|acters]') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0] .replace("'","").split()[0]+' lines' else: screen_size = 'NA'
```
Mandar helped me out with extracting the ‘dimensions’ attribute from the dataset and performing some transformations on it to get the total volume of the phone. It was achieved using the following code snippet.
```
dimensions = doc['Body']['Dimensions'] dimensions = re.sub (r'[^\s*\w*.-]', '', dimensions.split ('(') [0].split (',') [0].split ('mm') [0]).strip ('-').strip ('x') if not dimensions: dimensions = 'NA' total_area = 'NA' else: if 'cc' in dimensions: total_area = dimensions.split ('cc') [0] else: total_area = reduce (operator.mul, [float (float (elem.split ('-') [0])/10) for elem in dimensions.split ('x')], 1) total_area = round(float(total_area),3)
```
We used PrettyTable to output the results in a clear and concise format.

Next, we stored the above data in a csv file and used Pandas, Matplotlib, Seaborn and IPython to do some quick exploratory data analysis and visualizations. The following depicts the top ten brands with the most number of mobile devices as per the dataset.

Then, we looked at the device area frequency for each brand using boxplots as depicted below. Based on the plot, it is quite evident that almost all the plots are right skewed, with a majority of the distribution of device dimensions (total area) falling in the range [0,150]. There are some notable exceptions like ‘Apple’ where the skew is considerably less than the general trend. On slicing the data for the brand ‘Apple’, we noticed that this was because devices from ‘Apple’ have an almost equal distribution based on the number of smartphones and tablets, leading to the distribution being almost normal.

Based on similar experiments, we noticed that tablets had larger dimensions as compared to mobile phones, and screen sizes followed that same trend. We made some quick plots with respect to the device areas as shown below.

Now, take a look at the above plots again. The second plot shows the distribution of device areas in a kernel density plot. This distribution resembles a Gaussian distribution but with a right skew. [Mandar reckons that it actually resembles a Logistic distribution, but who’s splitting hairs, eh? ;)] The histogram plot depicts the same, except here we see the frequency of devices vs the device areas. Looking at it closely, Mandar said that the bell shaped curve had the maximum number of devices and those must be all the smartphones, while the long thin tail on the right side must indicate tablets. So we set a cutoff of 160 cubic centimeters for distinguishing between phones and tablets.

We also decided to calculate the correlation between ‘Total Area’ and ‘Screen Size’ because as one might guess, devices with larger area have large screen sizes. So we transformed the screen sizes from textual to numeric format based on some processing, and calculated the correlation between them which came to be around 0.73 or 73%

We did get a high correlation between Screen Size and Device Area. However, I still wanted to investigate why we didn’t get a score close to 90%. On doing some data digging, I noticed an interesting pattern.

After looking at the above results, what came to our minds immediately was: why do phones with such small screen sizes have such big dimensions? We soon realized that these devices were either “feature phones” of yore or smartphones with a physical keypad!

Thus, we used screen sizes in conjunction with dimensions for labeling our devices. After a long discussion, we decided to use the following logic for labeling smartphones and tablets.
```
device_class = None if total_area >= 160.0: device_class = 'Tablet' elif total_area < 160.0: device_class = 'Phone' if 'lines' in screen_size: device_class = 'Phone' elif 'inches' in screen_size: if float(screen_size.split()[0]) < 6.0: device_class = 'Phone'
```
After all this fun and frolic with data analysis, we were able to label handheld devices correctly, just like we wanted it!

Originally published at blog.priceweave.com.
August 4, 2015
Why is Product Matching Difficult? | DataWeave
Product Matching is a combination of algorithmic and manual techniques to recognize and match identical products from different sources. Product matching is at the core of competitive intelligence for retail. A competitive intelligence product is most useful when it can accurately match products of a wide range of categories in a timely manner, and at scale.

Shown below is PriceWeave’s Products Tracking Interface, one of the features where product matching is in action. The Products Tracking Interface lets a brand or a retailer track their products and monitor prices, availability offers, discounts, variants, and SLAs on a daily (or a more frequent) basis.

A snapshot of products tracked for a large online mass merchant

Expanded view for a product shows the prices related data points from competing stores

Product Matching helps a retailer or a brand in several ways:
- Tracking competitor prices and stock availability
- Organizing seller listings on a marketplace platform
- Discovering gaps in product catalog
- Filling the missing attributes in product catalog information
- Comparing product life cycles across competitors
Given its criticality, every competitive intelligence product strives hard to make its product matching accurate and comprehensive. It is a hard problem, and one that cannot be complete addressed in an automated fashion. In the rest of this post, we will talk about why product matching is hard.

Product Matching Guidelines

Amazon provides a guideline to sellers about how they should write product catalog information in order to achieve a good product matching with respect to their seller listings. These guidelines apply to any retail store or marketplace platform. The trouble is, more often than not these guidelines are not followed, or cannot by retailers because they don’t have access to all the product related information. Some of the challenges are:
- Products either don’t have a UPC code or it is not available. There are also non-standard products, unbranded products, and private label products.
- There are products with slights variations in technical specifications, but the complete specs are not available.
- Retailers manage a huge catalog of accessories, for instance Electronics Accessories (screen guards, flip covers, fancy USB drives, etc.).
- Apparels and Lifestyle products often have very little by way of unique identifiers. There is no standard nomenclature for colors, material and style.
- Products are often bundled with accessories or other related products. There are no standard ways of doing product bundling.
In the absence of standard ways of representing products, every retailer uses their own internal product IDs, product descriptions, and attribute names.

Algorithmic Product Matching using “Document Clustering”

Algorithmic product matching is done using some Machine Learning, typically techniques from Document Clustering. A document is a text document or a web page, or a set of terms that usually occur within a “context”. Document clustering is the process of bringing together (forming clusters of) similar documents, and separating our dissimilar ones. There are many ways of defining similarity of documents that we will not delve into in this post. Documents have “features” that act as “identifiers” that help an algorithm cluster them.

A document in our case is a product description — essentially a set of data points or attributes we have extracted from a product page. These attributes include: title, brand, category, price, and other specs. Therefore, these are the attributes that help us cluster together similar products and match products. The quality of clustering — that is how accurate and how complete the clusters are — depends on how good the features are. In our case, most of the times the features are not good, and that is what makes clustering, and in turn product matching, a hard problem.

Noisy Small Factually Weak (NSFW) Documents

The documents that we deal with, the product descriptions, are not well formed and so not readily usable for product matching. We at PriceWeave characterize them endearignly as Noisy Weak and Factually Weak (NSFW) documents. Let us see some examples to understand these terms.

Noisy
- Spelling errors, non-standard and/or incomplete representations of product features.
- Brands written as “UCB” and “WD” instead of “United Colors of Benetton” and “Western Digital”.
- Model no.s might or might not be present. A camera’s model number written as one of the following variants: DSC-WX650 vs DSCWX650 vs DSC WX 650 vs WX 650.
- Noisy/meaningless terms might be present (“brand new”, “manufacturer’s warranty”, “with purchase receipt”)
Small
- Not much description. A product simply written as “Apple iPhone” without any mention of its generation, or other features.
- Not many distinguishable features. Example, “Samsung Galaxy Note vs Samsung Galaxy Note 2”, “Apple ipad 3 16 GB wifi+cellular vs Apple ipad mini 16 GB wifi-cellular”
Factually Weak
- Products represented with generic and subjective descriptions.
- Colours and their combinations might be represented differently. Examples, “Puma Red Striped Bag”, “Adidas Black/Red/Blue Polo Tshirt”.
In the absence of clean, sufficient, and specific product information, the quality of algorithmic matching suffers. Product matching include many knobs and switches to adjust the weights given to different product attributes. For example, we might include a rule that says, “if two products are identical, then they fall in the same price range.” While such rules work well generally, they vary widely from category to category and across geographies. Further, adding more and more specific rules will start throwing off the algorithms in unexpected ways rendering them less effective.

In this post, we discussed the challenges posed by product matching that make it a hard problem to crack. In the next post, we will discuss how we address these challenges to make PriceWeave’s product matching robust.

PriceWeave is an all-around Competitive Intelligence product for retailers, brands, and manufacturers. We’re built on top of huge amounts of products data to provide real-time actionable insights. PriceWeave’s offerings include: pricing intelligence, assortment intelligence, gaps in catalogs, and promotion analysis. Please visit PriceWeave to view all our offerings. If you’d like to try us out request for a demo.

Originally published at blog.priceweave.com.
August 4, 2015
A Peek into GNU Parallel
GNU Parallel is a tool that can be deployed from a shell to parallelize job execution. A job can be anything from simple shell scripts to complex interdependent Python/Ruby/Perl scripts. The simplicity of ‘Parallel’ tool lies in it usage. A modern day computer with multicore processors should be enough to run your jobs in parallel. A single core computer can also run the tool, but the user won’t be able to see any difference as the jobs will be context switched by the underlying OS.

At DataWeave, we use Parallel for automating and parallelizing a number of resource extensive processes ranging from crawling to data extraction. All our servers have 8 cores with capability of executing 4 threads in each. So, we experienced huge performance gain after deploying Parallel. Our in-house image processing algorithms used to take more than a day to process 200,000 high resolution images. After using Parallel, we have brought the time down to a little over 40 minutes!

GNU Parallel can be installed on any Linux box and does not require sudo access. The following command will install the tool:
```
(wget -O - pi.dk/3 || curl pi.dk/3/) | bash
```
GNU Parallel can read inputs from a number of sources — a file or command line or stdin. The following simple example takes the input from the command line and executes in parallel:
```
parallel echo ::: A B C
```
The following takes the input from a file:
```
parallel -a somefile.txt echo

Or STDIN:


cat somefile.txt | parallel echo
```
The inputs can be from multiple files too:
```
parallel -a somefile.txt -a anotherfile.txt echo
```
The number of simultaneous jobs can be controlled using the — jobs or -j switch. The following command will run 5 jobs at once:
```
parallel --jobs 5 echo ::: A B C D E F G H I J
```
By default, the number of jobs will be equal to the number of CPU cores. However, this can be overridden using percentages. The following will run 2 jobs per CPU core:
```
parallel --jobs 200% echo ::: A B C D
```
If you do not want to set any limit, then the following will use all the available CPU cores in the machine. However, this is NOT recommended in production environment as other jobs running on the machine will be vastly slowed down.
```
parallel --jobs 0 echo ::: A B C
```
Enough with the toy examples. The following will show you how to bulk insert JSON documents in parallel in a MongoDB cluster. Almost always we need to insert millions of document quickly in our MongoDB cluster and inserting documents serially doesn’t cut it. Moreover, MongoDB can handle parallel inserts.

The following is a snippet of a file with JSON document. Let’s assume that there are a million similar records in the file with one JSON document per line.
```
{“name”: “John”, “city”: “Boston”, “age”: 23} {“name”: “Alice”, “city”: “Seattle”, “age”: 31} {“name”: “Patrick”, “city”: “LA”, “age”: 27} ... ...
```
The following Python script will get each JSON document and insert into “people” collection under “dw” database.
```
import json

import pymongo

import sys

document = json.loads(sys.argv[1])

client = pymongo.MongoClient()

db = client[“dw”]

collection = db[“people”]

try:

    collection.insert(document)

except Exception as e:

    print “Could not insert document in db”, repr(e)
```
Now to run this parallely, the following command should do the magic:
```
cat people.json | parallel ‘python insertDB.py {}’
```
That’s it! There are many switches and options available for advanced processing. They can be accessed by doing a man parallel on the shell. Also the following page has a set of tutorials: GNU Parallel Tutorials.
August 4, 2015
How to Conquer Data Mountains API by API | DataWeave
Let’s revisit our raison d’être: DataWeave is a platform on which we do large-scale data aggregation and serve this data in forms that are easily consumable. The nature of the data that we deal with is that: (1) it is publicly available on the web, (2) it is factual (to the extent possible), and (3) it has high utility (decided by a number of factors that we discuss below).

The primary access channel for our data are the Data API. Other access channels such as visualizations, reports, dashboards, and alerting systems are built on top of our data APIs. Data Products such as PriceWeave, are built by combining multiple APIs and packaging them with reporting and analytics modules.

Even as the platform is capable of aggregating any kind of data on the web, we need to prioritize the data that we aggregate, and the data products that we build. There are a lot of factors that help us in deciding what kinds of data we must aggregate and the APIs we must provide on DataWeave. Some of these factors are:

Business Case: A strong business use-case for the API. There has to be an inherent pain point the data set must solve. Be it the Telecom Tariffs AP or Price Intelligence API — there are a bunch of pain points they solve for distinct customer segments.

Scale of Impact: There has to exist a large enough volume of potential consumers that are going through the pain points, that this data API would solve. Consider the volume of the target consumers for the Commodity Prices API, for instance.

Sustained Data Need: Data that a consumer needs frequently and/or on a long term basis has greater utility than data that is needed infrequently. We look at weather and prices all the time. Census figures, not so much.

Assured Data Quality: Our consumers need to be able to trust the data we serve: data has to be complete as well as correct. Therefore, we need to ensure that there exist reliable public sources on the Web that contain the data points required to create the API.

Once these factors are accounted for, the process of creating the APIs begins. One question that we are often asked is the following: how easy/difficult is it to create data APIs? That again depends on many factors. There are many dimensions to the data we are dealing with that helps us in deciding the level of difficulty. Below we briefly touch upon some of those:

1. Structure: Textual data on the Web can be structured/semi-structured/unstructured. Extracting relevant data points from semi-structured and unstructured data without the existence of a data model can be extremely tricky. The process of recognizing the underlying pattern, automating the data extraction process, and monitoring accuracy of extracted data becomes difficult when dealing with unstructured data at scale.

2. Temporality: Data can be static or temporal in nature. Aggregating static datas sets is an one time effort. Scenarios where data changes frequently or new data points are being generated pose challenges related to scalability and data consistency. For e.g., The India Local Mandi Prices AP gets updated on a day-to-day basis with new data being added. When aggregating data that is temporal, monitoring changes to data sources and data accuracy becomes a challenge. One needs to have systems in place that ensure data is aggregated frequently and also monitored for accuracy.

3. Completeness: At one end of the spectrum we have existing data sets that are publicly downloadable. On the other end, we have data points spread across sources. In order to create data sets over these data points, these data points need to be aggregated and curated in order for them to be used. These data sources publish data in their own format, update them at different intervals. As always, “the whole is larger than the sum of its parts”; these individual data points when aggregated and presented together have many more use cases than those for the individual data points themselves.

4. Representations: Data on the Web exists in various formats including (if we are particularly unlucky!) non-standard/proprietary ones. Data exists in HTML, XML, XLS, PDFs, docs, and many more. Extracting data from these different formats and presenting them through standard representations comes with its own challenges.

5. Complexity: The data sets wherein data points are independent of each other are fairly simple to reason about. On the other hand, consider network data sets such as social data, maps, and transportation networks. The complexity arises due to the relationships that can exist between data points within and across data sets. The extent of pre-processing required to analyse these relationships makes these data sets is huge even on a small scale.

6 .(Pre/Post) Processing: There is a lot of pre-processing involved to make raw crawled data presentable and accessible through a data API. This involves, cleaning, normalization, and representing data in standard forms. Once we have the data API, there can be a number of way that this data can be processed to create new and interesting APIs.

So, that at a high level, is the way we work at DataWeave. Our vision is that of curating and providing access to all of the world’s public data. We are progressing towards this vision one API at a time.

Originally published at blog.dataweave.in.
August 4, 2015
Difference Between Json, Ultrajson, & Simplejson | DataWeave

Without argument, one of the most common used data model is JSON. There are two popular packages used for handling json — first is the stockjsonpackage that comes with default installation of Python, the other one issimplejson which is an optimized and maintained package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.

We have done the benchmark on three popular operations — load, loadsanddumps. We have a dictionary with 3 keys — id, name and address. We will dump this dictionary using json.dumps() and store it in a file. Then we will use json.loads() and json.load() separately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000,200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.

DUMPS OPERATION LINE BY LINE

Here is the result we received using the json.dumps() operations. We have dumped the content dictionary by dictionary.

We notice that json performs better than simplejson but ultrajson wins the game with almost 4 times speedup than stock json.

DUMPS OPERATION (ALL DICTIONARIES AT ONCE)

In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps().

simplejson is almost as good as stock json, but again ultrajson outperforms them by more than 60% speedup. Now lets see how they perform for load and loads operation.

LOAD OPERATION ON A LIST OF DICTIONARIES

Now we do the load operation on a list of dictionaries and compare the results.

Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 4 times faster than stock json, same with ultrajson.

LOADS OPERATION ON DICTIONARIES

In this experiment, we load dictionaries from the file one by one and pass them to the json.loads() function.

Again ultrajson steals the show, being almost 6 times faster than stock json and 4 times faster than simplejson.

That is all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any case, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

This post originally appeared here.

August 4, 2015
Mining Twitter for Reactions to Products & Brands | DataWeave
[This post was written by Dipanjan. Dipanjan works in the Engineering Team with Mandar, addressing some of the problems related to Data Semantics. He loves watching English Sitcoms in his spare time. This was originally posted on the PriceWeave blog.]

This is the second post in our series of blog posts which we shall be presenting regarding social media analysis. We have already talked about Twitter Mining in depth earlier and also how to analyze social trends in general and gather insights from YouTube. If you are more interested in developing a quick sentiment analysis app, you can check our short tutorial on that as well.

Our flagship product, PriceWeave, is all about delivering real time actionable insights at scale. PriceWeave helps Retailers and Brands take decisions on product pricing, promotions, and assortments on a day to day basis. One of the areas we focus on is “Social Intelligence”, where we measure our customers’ social presence in terms of their reach and engagement on different social channels. Social Intelligence also helps in discovering brands and products trending on social media.

Today, I will be talking about how we can get data from Twitter in real-time and perform some interesting analytics on top of that to understand social reactions to trending brands and products.

In our last post, we had used Twitter’s Search API for getting a selective set of tweets and performed some analytics on that. But today, we will be using Twitter’s Streaming API, to access data feeds in real time. A couple of differences with regards to the two APIs are as follows. The Search API is primarily a REST API which can be used to query for “historical data”. However, the Streaming API gives us access to Twitter’s global stream of tweets data. Moreover, it lets you acquire much larger volumes of data with keyword filters in real-time compared to normal search.

Installing Dependencies

I will be using Python for my analysis as usual, so you can install it if you don’t have it already. You can use another language of your choice, but remember to use the relevant libraries of that language. To get started, install the following packages, if you don’t have them already. We use simplejson for JSON data processing at DataWeave, but you are most welcome to use the stock json library.

Acquiring Data

We will use the Twitter Streaming API and the equivalent python wrapper to get the required tweets. Since we will be looking to get a large number of tweets in real time, there is the question of where should we store the data and what data model should be used. In general, when building a robust API or application over Twitter data, MongoDB being a schemaless document-oriented database, is a good choice. It also supports expressive queries with indexing, filtering and aggregations. However, since we are going to analyze a relatively small sample of data using pandas, we shall be storing them in flat files.

Note: Should you prefer to sink the data to MongoDB, the mongoexportcommand line tool can be used to export it to a newline delimited format that is exactly the same as what we will be writing to a file.

The following code snippet shows you how to create a connection to Twitter’s Streaming API and filter for tweets containing a specific keyword. For simplicity, each tweet is saved in a newline-delimited file as a JSON document. Since we will be dealing with products and brands, I have queried on two trending products and brands respectively. They are, ‘Sony’ and ‘Microsoft’ with regards to brands and ‘iPhone 6’ and ‘Galaxy S5’ with regards to products. You can write the code snippet as a function for ease of use and call it for specific queries to do a comparative study.

Let the data stream for a significant period of time so that you can capture a sizeable sample of tweets.

Analyses and Visualizations

Now that you have amassed a collection of tweets from the API in a newline delimited format, let’s start with the analyses. One of the easiest ways to load the data into pandas is to build a valid JSON array of the tweets. This can be accomplished using the following code segment.

Note: With pandas, you will need to have an amount of working memory proportional to the amount of data that you’re analyzing.

Once you run this, you should get a dictionary containing 4 data frames. The output I obtained is shown in the snapshot below.

Note: Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of real time tweets, and anything beyond that is filtered out with each “limit notice”.

The next snippet shows how to remove the “limit notice” column if you encounter it.

Time-based Analysis

Each tweet we captured had a specific time when it was created. To analyze the time period when we captured these tweets, let’s create a time-based index on the created_at field of each tweet so that we can perform a time-based analysis to see at what times do people post most frequently about our query terms.

The output I obtained is shown in the snapshot below.

I had started capturing the Twitter stream at around 7 pm on the 6th of December and stopped it at around 11:45 am on the 7th of December. So the results seem consistent based on that. With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms and so on. Operations such as grouping by a time unit are also easy to accomplish and seem a logical next step. The following code snippet illustrates how to group by the “hour” of our data frame, which is exposed as a datetime.datetime timestamp since we now have a time-based index in place. We print an hourly distribution of tweets also just to see which brand \ product was most talked about on Twitter during that time period.

The outputs I obtained are depicted in the snapshot below.

The “Hour” field here follows a 24 hour format. What is interesting here is that, people have been talking more about Sony than Microsoft in Brands. In Products, iPhone 6 seems to be trending more than Samsung’s Galaxy S5. Also the trend shows some interesting insights that people tend to talk more on Twitter in the morning and late evenings.

Time-based Visualizations

It could be helpful to further subdivide the time ranges into smaller intervals so as to increase the resolution of the extremes. Therefore, let’s group into a custom interval by dividing the hour into 15-minute segments. The code is pretty much the same as before except that you call a custom function to perform the grouping. This time, we will be visualizing the distributions using matplotlib.

The two visualizations are depicted below. Of course don’t forget to ignore the section of the plots from after 11:30 am to around 7 pm because during this time no tweets were collected by me. This is indicated by a steep rise in the curve and is insignificant. The real regions of significance are from hour 7 to 11:30 and hour 19 to 22.

Considering brands, the visualization for Microsoft vs. Sony is depicted below. Sony is the clear winner here.

Considering products, the visualization for iPhone 6 vs. Galaxy S5 is depicted below. The clear winner here is definitely iPhone 6.

Tweeting Frequency Analysis

In addition to time-based analysis, we can do other types of analysis as well. The most popular analysis in this case would be frequency based analysis of the users authoring the tweets. The following code snippet will compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared for each of our query terms.

The results which I obtained are depicted below.

What we do notice is that a lot of these tweets are also made by bots, advertisers and SEO technicians. Some examples are Galaxy_Sleeves and iphone6_sleeves which are obviously selling covers and cases for the devices.

Tweeting Frequency Visualizations

After frequency analysis, we can plot these frequency values to get better intuition about the underlying distribution, so let’s take a quick look at it using histograms. The following code snippet created these visualizations for both brands and products using subplots.

The visualizations I obtained are depicted below.

The distributions follow the “Pareto Principle” as expected where we see that a selective number of users make a large number of tweets and the majority of users create a small number of tweets. Besides that, we see that based on the tweet distributions, Sony and iPhone 6 are more trending than their counterparts.

Locale Analysis

Another important insight would be to see where your target audience is located and their frequency. The following code snippet achieves the same.

The outputs which I obtained are depicted in the following snapshot. Remember that Twitter follows the ISO 639–1 language code convention.

The trend we see is that most of the tweets are from English speaking countries as expected. Surprisingly, most of the Tweets regarding iPhone 6 are from Japan!

Analysis of Trending Topics

In this section, we will see some of the topics which are associated with the terms we used for querying Twitter. For this, we will be running our analysis on the tweets where the author speaks in English. We will be using the nltklibrary here to take care of a couple of things like removing stopwords which have little significance. Now I will be doing the analysis here for brands only, but you are most welcome to try it out with products too because, the following code snippet can be used to accomplish both the computations.

What the above code does is that, it takes each tweet, tokenizes it and then computes a term frequency and outputs the 20 most common terms for each brand. Of course an n-gram analysis can give a deeper insight into trending topics but the same can also be accomplished with ntlk’s collocations function which takes in the tokens and outputs the context in which they were mentioned. The outputs I obtained are depicted in the snapshot below.

Some interesting insights we see from the above outputs are as follows.
- Sony was hacked recently and it was rumored that North Korea was responsible for that, however they have denied that. We can see that is trending on Twitter in context of Sony. You can read about it here.
- Sony has recently introduced Project Sony Skylight which lets you customize your PS4.
- There are rumors of Lumia 1030, Microsoft’s first flagship phone.
- People are also talking a lot about Windows 10, the next OS which is going to be released by Microsoft pretty soon.
- Interestingly, “ebay price” comes up for both the brands, this might be an indication that eBay is offering discounts for products from both these brands.
To get a detailed view on the tweets matching some of these trending terms, we can use nltk’s concordance function as follows.

The outputs I obtained are as follows. We can clearly see the tweets which contain the token we searched for. In case you are unable to view the text clearly, click on the image to zoom.

Thus, you can see that the Twitter Streaming API is a really good source to track social reaction to any particular entity whether it is a brand or a product. On top of that, if you are armed with an arsenal of Python’s powerful analysis tools and libraries, you can get the best insights from the unending stream of tweets.

That’s all for now folks! Before I sign off, I would like to thank Matthew A. Russell and his excellent book Mining the Social Web once again, without which this post would not have been possible. Cover image credit goes to TechCrunch.
December 9, 2014
Mining Twitter to Analyze Product Trends | DataWeave
Due to the massive growth of social media in the last decade, it has become a rage among data enthusiasts to tap into the vast pool of social data and gather interesting insights like trending items, reception of newly released products by society, popularity measures to name a few.

We are continually evolving PriceWeave, which has the most extensive set of offerings when it comes to providing actionable insights to retail stores and brands. As part of the product development, we look at social data from a variety of channels to mine things like: trending products/brands; social engagement of stores/brands; what content “works” and what does not on social media, and so forth.

We do a number of experiments with mining Twitter data, and this series of blog posts is one of the outputs from those efforts.

In some of our recent blog posts, we have seen how to look at current trends and gather insights from YouTube the popular video sharing website. We have also talked about how to create a quick bare-bones web application to perform sentiment analysis of tweets from Twitter. Today I will be talking about mining data from Twitter and doing much more with it than just sentiment analysis. We will be analyzing Twitter data in depth and then we will try to get some interesting insights from it.

To get data from twitter, first we need to create a new Twitter application to get OAuth credentials and access to their APIs. For doing this, head over to the Twitter Application Management page and sign in with your Twitter credentials. Once you are logged in, click on the Create New App button as you can see in the snapshot below. Once you create the application, you will be able to view it in your dashboard just like the application I created, named DataScienceApp1_DS shows up in my dashboard depicted below.

On clicking the application, it will take you to your application management dashboard. Here, you will find the necessary keys you need in the Keys and Access Tokens section. The main tokens you need are highlighted in the snapshot below.

I will be doing most of my analysis using the Python programming language. To be more specific, I will be using the IPython shell, but you are most welcome to use the language of your choice, provided you get the relevant API wrappers and necessary libraries.

Installing necessary packages

After obtaining the necessary tokens, we will be installing some necessary libraries and packages, namely twitter, prettytable and matplotlib. Fire up your terminal or command prompt and use the following commands to install the libraries if you don’t have them already.

Creating a Twitter API Connection

Once the packages are installed, you can start writing some code. For this, open up the IDE or text editor of your choice and use the following code segment to create an authenticated connection to Twitter’s API. The way the following code snippet works, is by using your OAuth credentials to create an object called auth that represents your OAuth authorization. This is then passed to a class called Twitter belonging to the twitter library and we create a resource object named twitter_api that is capable of issuing queries to Twitter’s API.

If you do a print twitter_api and all your tokens are corrent, you should be getting something similar to the snapshot below. This indicates that we’ve successfully used OAuth credentials to gain authorization to query Twitter’s API.

Exploring Trending Topics

Now that we have a working Twitter resource object, we can start issuing requests to Twitter. Here, we will be looking at the topics which are currently trending worldwide using some specific API calls. The API can also be parameterized to constrain the topics to more specific locales and regions. Each query uses a unique identifier which follows the Yahoo! GeoPlanet’s Where On Earth (WOE) ID system, which is an API itself that aims to provide a way to map a unique identifier to any named place on Earth. The following code segment retrieves trending topics in the world, the US and in India.

Once you print the responses, you will see a bunch of outputs which look like JSON data. To view the output in a pretty format, use the following commands and you will get the output as a pretty printed JSON shown in the snapshot below.

To view all the trending topics in a convenient way, we will be using list comprehensions to slice the data we need and print it using prettytable as shown below.

On printing the result, you will get a neatly tabulated list of current trends which keep changing with time.

Now, we will try to analyze and see if some of these trends are common. For that we use Python’s set data structure and compute intersections to get common trends as shown in the snapshot below.

Interestingly, some of the trending topics at this moment in the US are common with some of the trending topics in the world. The same holds good for US and India.

Mining for Tweets

In this section, we will be looking at ways to mine Twitter for retrieving tweets based on specific queries and extracting useful information from the query results. For this we will be using Twitter API’s GET search/tweets resource. Since the Google Nexus 6 phone was launched recently, I will be using that as my query string. You can use the following code segment to make a robust API request to Twitter to get a size-able number of tweets.

The code snippet above, makes repeated requests to the Twitter Search API. Search results contain a special search_metadata node that embeds a next_results field with a query string that provides the basis of making a subsequent query. If we weren’t using a library like twitter to make the HTTP requests for us, this preconstructed query string would just be appended to the Search API URL, and we’d update it with additional parameters for handling OAuth. However, since we are not making our HTTP requests directly, we must parse the query string into its constituent key/value pairs and provide them as keyword arguments to the search/tweets API endpoint. I have provided a snapshot below, showing how this dictionary of key/value pairs are constructed which are passed as kwargs to the Twitter.search.tweets(..) method.

Analyzing the structure of a Tweet

In this section we will see what are the main features of a tweet and what insights can be obtained from them. For this we will be taking a sample tweet from our list of tweets and examining it closely. To get a detailed overview of tweets, you can refer to this excellent resource created by Twitter. I have extracted a sample tweet into the variable sample_tweet for ease of use. sample_tweet.keys() returns the top-level fields for the tweet.

Typically, a tweet has some of the following data points which are of great interest.

The identifier of the tweet can be accessed through sample_tweet[‘id’]
- The human-readable text of a tweet is available through sample_tweet[‘text’]
- The entities in the text of a tweet are conveniently processed and available through sample_tweet[‘entities’]
- The “interestingness” of a tweet is available through sample_tweet[‘favorite_count’] and sample_tweet[‘retweet_count’], which return the number of times it’s been bookmarked or retweeted, respectively
- An important thing to note, is that, the retweet_count reflects the total number of times the original tweet has been retweeted and should reflect the same value in both the original tweet and all subsequent retweets. In other words, retweets aren’t retweeted
- The user details can be accessed through sample_tweet[‘user’] which contains details like screen_name, friends_count, followers_count, name, location and so on
Some of the above datapoints are depicted in the snapshot below for the sample_tweet. Note, that the names have been changed to protect the identity of the entity that created the status.

Before we move on to the next section, my advice is that you should play around with the sample tweet and consult the documentation to clarify all your doubts. A good working knowledge of a tweet’s anatomy is critical to effectively mining Twitter data.

Extracting Tweet Entities

In this section, we will be filtering out the text statuses of tweets and different entities of tweets like hashtags. For this, we will be using list comprehensions which are faster than normal looping constructs and yield substantial perfomance gains. Use the following code snippet to extract the texts, screen names and hashtags from the tweets. I have also displayed the first five samples from each list just for clarity.

Once you print the table, you should be getting a table of the sample data which should look something like the table below but with different content ofcourse!

Frequency Analysis of Tweet and Tweet Entities

Once we have all the required data in relevant data structures, we will do some analysis on it. The most common analysis would be a frequency analysis where we find out the most common terms occurring in different entities of the tweets. For this we will be making use of the collection module. The following code snippet ranks the top ten most occurring tweet entities and prints them as a table.

The output I obtained is shown in the snapshot below. As you can see, there is a lot of noise in the tweets because of which several meaningless terms and symbols have crept into the top ten list. For this, we can use some pre-processing and data cleaning techniques.

Analyzing the Lexical Diversity of Tweets

A slightly more advanced measurement that involves calculating simple frequencies and can be applied to unstructured text is a metric called lexical diversity. Mathematically, lexical diversity can be defined as an expression of the number of unique tokens in the text divided by the total number of tokens in the text. Let us take an example to understand this better. Suppose you are listening to someone who repeatedly says “and stuff” to broadly generalize information as opposed to providing specific examples to reinforce points with more detail or clarity. Now, contrast that speaker to someone else who seldom uses the word “stuff” to generalize and instead reinforces points with concrete examples. The speaker who repeatedly says “and stuff” would have a lower lexical diversity than the speaker who uses a more diverse vocabulary.

The following code snippet, computes the lexical diversity for status texts, screen names, and hashtags for our data set. We also measure the average number of words per tweet.

The output which I obtained is depicted in the snapshot below.

Now, I am sure you must be thinking, what on earth do the above numbers indicate? We can analyze the above results as follows.
- The lexical diversity of the words in the text of the tweets is around 0.097. This can be interpreted as, each status update carries around 9.7% unique information. The reason for this is because, most of the tweets would contain terms like Android, Nexus 6, Google
- The lexical diversity of the screen names, however, is even higher, with a value of 0.59 or 59%, which means that about 29 out of 49 screen names mentioned are unique. This is obviously higher because in the data set, different people will be posting about Nexus 6
- The lexical diversity of the hashtags is extremely low at a value of around 0.029 or 2.9%, implying that very few values other than the #Nexus6hashtag appear multiple times in the results. This is relevant because tweets about Nexus 6 should contain this hashtag
- The average number of words per tweet is around 18 words
This gives us some interesting insights like people mostly talk about Nexus 6 when queried for that search keyword. Also, if we look at the top hashtags, we see that Nexus 5 co-occurs a lot with Nexus 6. This might be an indication that people are comparing these phones when they are tweeting.

Examining Patterns in Retweets

In this section, we will analyze our data to determine if there were any particular tweets that were highly retweeted. The approach we’ll take to find the most popular retweets, is to simply iterate over each status update and store out the retweet count, the originator of the retweet, and status text of the retweet, if the status update is a retweet. We will be using a list comprehension and sort by the retweet count to display the top few results in the following code snippet.

The output I obtained is depicted in the following snapshot.

From the results, we see that the top most retweet is from the official googlenexus channel on Twitter and the tweet speaks about the phone being used non-stop for 6 hours on only a 15 minute charge. Thus, you can see that this has definitely been received positively by the users based on its retweet count. You can detect similar interesting patterns in retweets based on the topics of your choice.

Visualizing Frequency Data

In this section, we will be creating some interesting visualizations from our data set. For plotting we will be using matplotlib, a popular Python plotting library which comes inbuilt with IPython. If you don’t have matplotlib loaded by default use the command import matplotlib.pyplot as plt in your code.

Visualizing word frequencies

In our first plot, we will be displayings the results from the words variable which contains different words from the tweet status texts. Using Counter from the collections package, we generate a sorted list of tuples, where each tuple is a (word, frequency) pair. The x-axis value will correspond to the index of the tuple, and the y-axis will correspond to the frequency for the word in that tuple. We transform both axes into a logarithmic scale because of the vast number of data points.

Visualizing words, screen names, and hashtags

A line chart of frequency values is decent enough. But what if we want to find out the number of words having a frequency between 1–5, 5–10, 10–15… and so on. For this purpose we will be using a histogram to depict the frequencies. The following code snippet achieves the same.

What this essentially does is, it takes all the frequencies and groups them together and creates bins or ranges and plots the number of entities which fall in that bin or range. The plots I obtained are shown below.

From the above plots, we can observe that, all the three plots follow the “Pareto Principle” i.e, almost 80% of the words, screen names and hashtags have a frequency of only 20% in the whole data set and only 20% of the words, screen names and hashtags have a frequency of more than 80% in the data set. In short, if we consider hashtags, a lot of hashtags occur maybe only once or twice in the whole data set and very few hashtags like #Nexus6 occur in almost all the tweets in the data set leading to its high frequency value.

Visualizing retweets

In this visualization, we will be using a histogram to visualize retweet counts using the following code snippet.

The plot which I obtained is shown below.

Looking at the frequency counts, it is clear that very few retweets have a large count.

I hope you have seen by now, how powerful Twitter APIs are and using simple Python libraries and modules, it is really easy to generate very powerful and interesting insights. That’s all for now folks! I will be talking more about Twitter Mining in another post sometime in the future. A ton of thanks goes out to Matthew A. Russell and his excellent book Mining the Social Web, without which this post would never have been possible. Cover image credit goes to Social Media.
December 5, 2014