Wisdom of Unstructured Data: Building Airbnb’s Listing Knowledge from Big Text Data | by Hongwei Harvey Li | The Airbnb Tech Blog

How Airbnb leverages ML/NLP to extract useful information about listings from unstructured text data to power personalized experiences for guests.By: Hongwei Li and Peng WangAt Airbnb, it’s important for us to gather structured data about listings and better understand the data, so we can help Hosts provide great experiences for guests. For example, guests who work remotely need to know if a listing has a suitable workspace and reliable internet, while guests with children might need items like highchairs and cribs. However, not all listings clearly display these attributes, causing there to be a mismatch between what Hosts listings have and what guests are looking for.This is just one of many examples of how we can use the unstructured data generated on our platform, including text data that has undergone anonymization steps from various text-based guest interactions with the platform, to extract useful structure data. Instead of relying on Hosts to manually input all the potential listing attributes, which would be tedious given the vast number of attributes guests care and inquire about, we developed a machine learning system called Listing Attribute Extraction Platform (LAEP) for extracting the structure data at scale. Note that the original name of the project is called LATEX (Listing ATtribute EXtraction) and it is cited in our previous tech blog. We have since renamed the project to LAEP.LAEP automatically extracts structured information, such as listing attributes, directly from the unstructured text data we mentioned above. The attributes collected by LAEP are then integrated into various applications, building Airbnb’s Listing Knowledge. It powers downstream tools like the Attribute Prioritization System (APS) and listing attribute collection system (Eve).LAEP doesn’t just extract listing attributes, it has the ability to detect different types of entities, such as activities, hospitalities, and points of interest (POI) like famous landmarks. This opens up possibilities for supporting a wide range of product applications. For example, hospitality data can help guests get personalized services during the stay while activity data can help identify and create new categories that guests love.Figure 1. An illustration of the process of from LAEP to downstream applications such as listing attribute collection system (Eve) and attribute prioritization system (APS), then feeds into Structure Data Catalog.Prior to LAEP, Airbnb had multiple ways to collect structured information for listings, including the Listing Editors page for Hosts, the Supplementary Review Flow (SRF) for guests, and partnering with third-party vendors. However, these approaches faced several challenges and limitations. For instance, Airbnb minimized the impression of SRF questions in the standard review flow to boost the guest review experience, resulting in reduced data intake from the guest side. Consequently, there has been a growing need to extract listing information from unstructured text data, and LAEP was developed to address the aforementioned issues by automating this data collection process.The LAEP technology gathers and analyzes anonymous and unstructured text data, enabling many potential applications that can enhance the Airbnb experience for both Hosts and guests.There are three main components in LAEP:Named Entity Recognition (NER): This component identifies and classifies specific phrases or entities in free text into predefined categories like amenities, places of interest, and facilities. For example, from various sources the phrase “swimming pool” would be detected as an entity with the type “Amenity”.Entity Mapping (EM): Once an entity is detected, EM maps it to standard listing attributes stored in Airbnb’s attribute database (Taxonomy). This allows LAEP to create a comprehensive catalog of Airbnb listings by associating detected entities with their corresponding attributes.Entity Scoring (ES): ES determines the presence of a detected phrase within a listing. It infers whether the attribute mentioned actually exists in the associated listing and provides confidence level.Below is an illustration of the components within LAEP is as follows:Figure 2. The scope of LAEP includes three main components: Named Entity Recognition, Entity Mapping and Entity Scoring.There are many off-the-shelf pretrained NER models that can extract general entity categories, but none of them fully supports Airbnb’s use cases. Therefore, we built our own NER models to detect and extract predefined entities important to Airbnb business from free text.The NER model defines five types of entities (Amenity, Facility, Hospitality, Location features, and Structural details) that are important to Airbnb. We sampled and labeled 30K example texts from six channels, then trained the NER model. For current product use cases, we apply the language detection module and filter out English text only. In the future we may build the multilingual Transformer based NER model to handle non-English content. Text is then split into tokens. NER mode localates entity span, and classifies entity labels by using a convolutional neural network (CNN) framework. The output is a list of detected named entities, in the format of tuples <entity label, start index, end index>. Combining all components together, the NER pipeline is shown in figure 3.Figure 3. The overview of NER pipeline and the functional componentsFigure 4 shows an example output from the pipeline with detected entities highlighted.Figure 4. Example output from the NER pipeline. The detected entities are highlighted and each entity category is marked with a different color.The labeled dataset was randomly split into training and testing datasets with a 9:1 ratio. After training completes, we evaluate the model performance on the testing dataset across all text channels. The evaluation criteria uses Strict Match which requires correctly identifying the boundary and category of the entity, simultaneously. The model overall performance and each category’s performance are in figure 5.Figure 5. Example performance metrics (Precision, Recall, F1 scores) for NER modelThere are many different ways for people to talk about the same thing. For instance, we found over twelve variations for the attribute “lockbox,” such as lock box, lock-box, box for the key, and keybox. Typos like “ket box” are also common due to input from error-prone mobile devices. Therefore, we need to map different variations of named-entities to the standard entity name as defined by the standard taxonomy for downstream applications.With hundreds of listing attributes but millions of detected phrases in a year, many phrases map to the same attribute (like “lockbox”) while others have no mapping. To address this, we introduce confidence levels for mappings, allowing us to establish rules for cases where mapping cannot be done. A confidence value between 0 and 1 is assigned, and if no mappings exceed the confidence threshold, it is marked as “No Mapping.”Labeling these mappings becomes challenging when dealing with numerous unique phrases and potential attributes. Typically, labeling involves comparing the semantic similarity between the phrase and each of the 800+ attribute names. To overcome this, we started with unsupervised learning methods to tackle the problem instead of using the supervised learning methods to save significant labeling efforts.Figure 6. Entity Mapping: map detected NER phrases from free text to predefined listing attributes.In LAEP, the entity mapping approach involves the following steps:Preprocessing: Both the listing attributes and detected phrases undergo preprocessing techniques such as lowercasing and lemmatization to eliminate unnecessary word variations.Mapping to Word Embeddings: All standard listing attributes are mapped to the word-embedding space using a word2vec model fine-tuned with Airbnb’s text data.Finding Closest Attribute: For a preprocessed detected phrase, the closest listing attribute is determined based on cosine similarity in the word-embedding space. The similarity score serves as the confidence score for the mapping.As the example in the figure above, the word “Lock-box” is mapped to the embedding space of listing attributes and compared with each attribute. The closest match is found with the attribute “lockbox,” which is identified as the top mapping.After mapping a detected phrase to a standard listing attribute, it’s important to infer metadata about the attribute, such as its existence, usability, and local sentiment. Among these, attribute presence is crucial for the guest experience, especially for the example of amenities like “crib” or “highchair” for guests with infants.The presence model in LAEP determines if the mapped attribute exists in the listing by performing local text classification. It provides a discrete output (YES, Unknown, NO) indicating attribute presence, accompanied by a confidence score reflecting the level of confidence in the inference.The label classes are {YES, Unknown, NO}, where Yes means the attribute is present, NO means it’s not present, and Unknown accounts for cases where presence is hard to determine from the text alone (e.g., amenity not present).Figure 7. Illustration of entity scoring for the meta info about certain entities of interest.To build this text classification model, the ES component employs a fine-tuned BERT model. It analyzes source data, including detected phrases and their local context, to infer attribute existence. The output can then be used in the APS and Eve system to provide recommendations to Hosts, merchandize existing home attributes, or clarify popular listing facilities..Figure 8. Architecture of Presence Score Model. (Revised based on courtesy from Zahera and Sherif et al.)The model architecture (Figure 8) utilizes a pre-trained BERT model with text data from six different sources. The input text is truncated to a maximum length of 512 tokens. Empirical studies suggest that using 65 words around the detected phrase (32 before and 32 after), achieves the best result. The embeddings from the [CLS] token are passed through a fully connected layer, dropout layer, and ReLU linear projection layer to generate a probabilistic vector over the label classes.In this post, we introduced an end-to-end structural information extraction system within Airbnb, LAEP, to detect phrases of interest from various text data sources, map them into standard listing attribute taxonomies, and then infer the meta information of the attributes from the contextual information in the texts while also having privacy by design controls with the objective to not process personal information. LAEP is applied in downstream applications like APS, and can be leveraged to help our teams find new categories of listings and discover new listing attributes that matter to guests. It helps us to understand Airbnb’s listing better with scale and can power future applications to continue improving the experience of both our Hosts and guests.If this type of work interests you, check out some of our related positions at Careers at Airbnb!We would like to thank all the people who supported this project — Qianru Ma, Joy Jing, Xiao Li, Brennan Polley, Paolo Massimi, Dean Chen, Guillaume Guy, Lianghao Li, Mia Zhao, Joy Zhang, Usman Abbasi, Pavan Tapadia, Jing Xia, Maggie Jarley and more. Special thanks to Ben Mendeler, Shaowei Su, Alfredo Luque and Tianxiang Chen from the ML-infra team for their generous support and help.All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.