Advanced Parsing¶
Parsing is the core of the AirDoo module. It transforms raw Airbnb emails into structured data usable by Odoo.
Parser Architecture¶
graph TD
A[Raw Email] --> B[HTML/Text Extraction]
B --> C[Cleaning & Normalization]
C --> D[Format Identification]
D --> E{Specific Parsing}
E -->|Format A| F[French Parser]
E -->|Format B| G[English Parser]
E -->|Format C| H[Portuguese Parser]
F --> I[Data Validation]
G --> I
H --> I
I --> J[Enrichment]
J --> K[Structured Data]
Supported Formats¶
Languages¶
- French: Standard France/Belgium/Switzerland format
- English: International format
- Portuguese: Brazil/Portugal format
- Spanish: Spain/Latin America format
- German: Germany/Austria/Switzerland format
- Italian: Italy format
Email Types¶
- Booking confirmation (primary)
- Booking modification
- Booking cancellation
- Host message (optional)
Structure of Extracted Data¶
Required Data¶
{
"confirmation_code": "HM2YEYHZXC", # Unique Airbnb code
"checkin_date": "2026-02-01",
"checkout_date": "2026-02-07",
"accommodation_name": "Chalet du Frenalay",
"guest_name": "John Smith",
"guest_email": "john@example.com",
"total_amount": 980.00,
"nights": 6,
"status": "confirmed"
}
Optional Data¶
{
"guest_phone": "+44712345678",
"guest_composition": {
"adults": 2,
"children": 1,
"babies": 0,
"children_ages": [5] # If available
},
"guest_notes": "Arriving around 4pm",
"breakdown": {
"nightly_rate": 150.00,
"cleaning_fee": 80.00,
"airbnb_fee": 147.00,
"taxes": 49.00
},
"currency": "EUR",
"source_language": "en"
}
Parsing Algorithms¶
1. Confirmation Code Extraction¶
def extract_confirmation_code(text):
# Pattern: 10 uppercase alphanumeric characters
pattern = r'\b[A-Z0-9]{10}\b'
match = re.search(pattern, text)
return match.group(0) if match else None
2. Date Extraction¶
def extract_dates(text, language='en'):
date_patterns = {
'fr': r'(\d{1,2}\s+\w+\s+\d{4})',
'en': r'(\w+\s+\d{1,2},\s+\d{4})',
'pt': r'(\d{1,2}\s+de\s+\w+\s+de\s+\d{4})'
}
# Extraction and conversion logic
3. Price Extraction¶
def extract_price(text, currency='EUR'):
# Supports different formats: 980,00€, €980.00, 980 EUR
patterns = [
r'(\d+[.,]\d+)\s*' + re.escape(currency),
re.escape(currency) + r'\s*(\d+[.,]\d+)',
]
Data Validation¶
Validation Rules¶
VALIDATION_RULES = {
'confirmation_code': {
'required': True,
'pattern': r'^[A-Z0-9]{10}$',
'message': 'Invalid confirmation code'
},
'checkin_date': {
'required': True,
'type': 'date',
'future': True,
'message': 'Invalid check-in date'
},
'total_amount': {
'required': True,
'type': 'float',
'min': 0.01,
'message': 'Invalid total amount'
},
'guest_email': {
'required': True,
'type': 'email',
'message': 'Invalid email'
}
}
Validation Process¶
- Syntactic validation: Format and type
- Semantic validation: Data consistency
- Business validation: AirDoo-specific rules
- Consistency validation: Dates, prices, etc.
Error Handling¶
Error Types¶
- Format errors: Unrecognized email
- Data errors: Missing or invalid data
- Consistency errors: Internal inconsistencies
- System errors: Technical issues
Error Logging¶
class ParsingErrorLog(models.Model):
_name = 'airdoo.parsing_error'
email_id = fields.Char('Email ID')
error_type = fields.Selection([
('format', 'Unsupported Format'),
('data', 'Invalid Data'),
('consistency', 'Inconsistency'),
('system', 'System Error')
])
error_message = fields.Text('Error Message')
raw_content = fields.Text('Raw Content')
parsed_data = fields.Text('Parsed Data')
resolution_status = fields.Selection([
('pending', 'Pending'),
('resolved', 'Resolved'),
('ignored', 'Ignored')
])
Parsing Customization¶
Custom Parsing Rules¶
class CustomParsingRule(models.Model):
_name = 'airdoo.parsing_rule'
name = fields.Char('Rule Name')
pattern = fields.Text('Regex Pattern')
field_to_extract = fields.Char('Field to Extract')
transformation = fields.Text('Transformation')
priority = fields.Integer('Priority')
active = fields.Boolean('Active')
Performance and Optimization¶
Parsing Cache¶
class ParsingCache:
def __init__(self):
self.cache = {}
self.max_size = 1000
def get(self, email_hash):
return self.cache.get(email_hash)
def set(self, email_hash, parsed_data):
if len(self.cache) >= self.max_size:
# LRU eviction
self.cache.pop(next(iter(self.cache)))
self.cache[email_hash] = parsed_data
Unit Tests¶
Test Structure¶
class TestAirbnbParser(unittest.TestCase):
def test_english_confirmation_email(self):
email_content = load_fixture('english_confirmation.eml')
result = parse_email(email_content, language='en')
self.assertIsNotNone(result)
self.assertEqual(result['confirmation_code'], 'HM2YEYHZXC')
self.assertEqual(result['nights'], 6)
self.assertEqual(result['total_amount'], 980.00)
def test_english_modification_email(self):
email_content = load_fixture('english_modification.eml')
result = parse_email(email_content, language='en')
self.assertIsNotNone(result)
self.assertEqual(result['status'], 'modified')
def test_invalid_email_format(self):
email_content = "This is not an Airbnb email"
result = parse_email(email_content)
self.assertIsNone(result)
Maintenance and Evolution¶
Updating Parsers¶
- Monitoring: Success rate by format
- Detection: New unsupported formats
- Adaptation: Update existing rules
- Testing: Validate before deployment
Best Practices¶
1. Robustness¶
- Handle edge cases: Partial emails, hybrid formats
- Strict validation: Reject doubtful data
- Smart fallback: Alternative extraction attempts
2. Performance¶
- Smart cache: Avoid unnecessary re-parsing
- Lazy parsing: Extract only what is needed
- Regex optimization: Compiled and efficient patterns
3. Maintainability¶
- Modular code: Separate parsers by language/format
- Tests: Maximum coverage of cases
← Back: User Guide | Next: Multi-Accommodations →