Hybrid Sentiment Engine | Sentiment Engineering Pipeline

Index	Sentiment	Post (Raw)
0	1	"Good morning every one"
1	0	"TW: S AssaultActually horrified..."
2	1	"Thanks by has notice of me Greetings : Jossett..."
3	0	"its ending soon aah unhappy 😧"
4	1	"My real time happy 😊"

Index	Emoji	Unicode Name
0	😍	SMILING FACE WITH HEART-SHAPED EYES
1	😭	LOUDLY CRYING FACE
2	😊	SMILING FACE WITH SMILING EYES

Index	Label	Text	emoji_pos	emoji_neg	word_pos	word_neg
0	1	"Good morning every one"	0	0	1	0
1	0	"TW: S AssaultActually horrified..."	0	10	1	1
2	1	"Thanks by has notice of me Greetings : Jossett..."	0	0	0	0
3	0	"its ending soon aah unhappy 😧"	0	10	0	1
4	1	"My real time happy 😊"	10	0	1	0

Research & Development

Click to expand steps

01 Data Ingestion

The pipeline initiates with an investigative ingestion pass of two complementary data streams. Rather than a simple bulk load, this stage validates the structural integrity of the raw corpora — identifying redundant metadata and non-standard column conventions before establishing a definitive data contract for all downstream processing.

Supervised Corpus

1k_data_emoji_tweets_senti_posneg.csv

1,000 labeled social text samples
Embedded UTF-8 emoji tokens preserved in-place
Binary sentiment targets (0 = negative, 1 = positive)

Emoji Reference Table

15_emoticon_data.csv

16 entries — Unicode codepoints & canonical names
Structural metadata only — no polarity scores encoded
Used downstream as an emoji coverage reference

Raw Schema — Tweets

Unnamed: 0 → dropped

post → text

sentiment → label

Raw Schema — Emoji Ref

Unnamed: 0 → dropped

Emoji → emoji

Unicode codepoint → unicode_codepoint

Investigation Output

"Ingestion identified redundant Unnamed: 0 indices and non-standard post / sentiment column naming. These findings served as the direct requirements for the dataset.py standardisation script and established the read-only contract: all downstream notebooks consume from data/processed/ only."

02 Data Cleaning & Validation

Acting on the findings from the ingestion pass, this stage enforces a deterministic cleaning protocol with hard validation gates. Rather than aggressive text transformation, the focus is structural standardisation and schema assertion — ensuring emojis and raw linguistic cues remain intact and unmodified for all downstream analysis.

Structural Alignment

Standardised Schema text, label
Type Enforcement str, int
Null Removal dropna + reset_index
Ref Table snake_case emoji, unicode_*

Preservation Guardrails

✔ Zero-loss emoji retention — no replacement or collapse
✔ No semantic transforms — raw text preserved exactly
✔ No modeling assumptions introduced at this stage

Hard Validation Assertions

assert Schema is exactly ["label", "text"]
assert All labels are strictly binary — label ∈ {0, 1}
assert No empty text fields — len(text) > 0 for all rows
assert All 16 emoji reference entries are unique — no duplicate emoji characters

Pipeline halts if any assertion fails — this stage is gated, not advisory.

ℹ

Dataset Contract: Two canonical artifacts are written to data/processed/ — tweets_clean.csv (1,000 rows × 2 cols) and emoji_reference_clean.csv (16 rows × 3 cols). Both are treated as immutable, read-only inputs for all subsequent feature engineering and modelling passes. Raw data is never accessed again.

03 Exploratory Analysis

Emoji Presence Rate

49.3% of Corpus

Exploratory analysis confirms that nearly half of all tweets contain at least one emoji — making emoji usage common, not incidental. This frequency alone justified proceeding with emoji-aware feature engineering rather than treating symbols as noise.

Label Distribution

Subtle Positive Lean

The corpus shows a natural skew toward positive sentiment (500 pos vs 436 neg). While stable, this 14.7% baseline lean informs the eventual need for deterministic guardrails.

Emoji Distribution

Bimodal & Sparse

~505 tweets carry zero emojis; ~465 carry exactly one. Fewer than 5% contain two or more — emojis act as binary anchors, not intensity counters.

Label Correlation

Positive Bias Anchor

Positive tweets use emojis at a significantly higher rate. As affective amplifiers, these symbols frequently outweigh textual weight, justifying a hybrid logic layer to mitigate positive drift.

Coverage Gap — The Critical Finding

Unique emojis
in corpus

Emojis in
reference table

Intersection
(overlap)

The reference table covers only 12 of the 34 unique emojis found in the corpus. This gap — confirmed here empirically — became the direct motivation for building an expanded polarity lexicon in the feature engineering stage.

Account for Positive Lean Orthogonal Signal Low-Dimensional Features Justified No Dense Representations

Design Decision

"EDA confirmed emoji-aware feature engineering is justified — but under strict constraints: simple, interpretable features only; no dense emoji representations. The subtle 14.7% positive class skew validates the use of a curated polarity lexicon paired with a deterministic logic layer to ensure reliable inference."

04 Emoji Sentiment Referencing

Before committing to any emoji feature design, the pipeline runs a controlled ablation — establishing a text-only baseline, then testing three emoji feature types incrementally. Only features that demonstrably improve on the baseline are retained.

Feature Ablation — Validation F1

Text Only (Baseline)

TF-IDF bigrams, 1,265 features

0.796

Reference

+ Emoji Presence

Binary has_emoji flag

0.775

−0.021 ✕ Rejected

+ Emoji Count

Raw emoji frequency per tweet

0.785

−0.011 ✕ Rejected

+ Emoji Polarity Counts

Separate pos / neg counts via curated lexicon

0.8235

+0.027 ✓ Accepted

Polarity Feature Construction NB 3.0 — §8

# Manually defined polarity sets (label-agnostic)

POSITIVE_EMOJIS = {"😍", "😊", "😁", "😘"}

NEGATIVE_EMOJIS = {"😭", "😧"}

# Count matches per polarity class

pos = sum(1 for c in text if c in POSITIVE_EMOJIS)

neg = sum(1 for c in text if c in NEGATIVE_EMOJIS)

# Stack with TF-IDF → final feature matrix

X_combined = hstack([X_tfidf, [[pos, neg]]])

↑ 1,265 dims + 2 dims = 1,267

Key Finding

"Feature semantics matter more than feature inclusion. Undirected emoji signals — presence or raw count — introduce more noise than signal. Only when direction is resolved (positive vs. negative) do emoji features improve on the text-only baseline. This distinction drives every decision downstream."

05 Feature Engineering

With the polarity mechanism validated in Step 4, this stage asks how far it can be pushed. Four independent signal sources are engineered, each justified by corpus evidence, and stacked into a single production feature matrix that features.py implements exactly.

Expanded Emoji Polarity

Lexicon expanded from 6 to 193 emojis — closing the 15.2% corpus coverage gap identified in EDA. Produces emoji_pos_count and emoji_neg_count.

0.834 F1

+0.038 vs baseline

Emoticon Fallback

Western-style emoticons (:), :() are invisible to character-level emoji extraction. Substring matching closes this structural gap.

0.838 F1

+0.042 vs baseline

Word Lexicon

TF-IDF may underweight low-frequency words. A deterministic lexicon injects stable priors as word_pos_count and word_neg_count.

0.840 F1

+0.044 vs baseline

Deterministic Sarcasm Veto

Final hybrid layer. If emoji_neg ≥ 10 and model prediction is Positive, the veto forces a Negative override to mitigate positive drift.

Hybrid Veto

Production Feature Matrix features.py

X_final = hstack([

X_tfidf, # 1,269 dims

emoji_pol, # 2 dims (×BOOST)

emoticon_pol, # 2 dims (×BOOST)

word_lex, # 2 dims

sarcasm_veto, # Hybrid Logic Layer

]) # → Deterministic Anchor Applied

06 Model Training & Evaluation

The production classifier is Logistic Regression — selected for interpretability and stability on a 1,000-sample dataset. Model selection was closed in NB 3.0; this stage trains on the finalised feature matrix, gates on a hard performance benchmark, and serialises the artifacts.

Validation Performance — 200 Held-Out Samples

0.8402

F1 Score ✓ Benchmark Met

82.5%

Accuracy

Negative Class

Precision 0.90

Recall 0.73

Positive Class

Precision 0.77

Recall 0.92

27 false positives vs 8 false negatives — the model is generous with positive predictions. The sarcasm veto in predict.py directly targets this failure mode.

Learned Feature Weights — Surprising Result

word_neg_count

Word lexicon — negative hits

−4.079

word_pos_count

Word lexicon — positive hits

+0.856

emoji_pos_count

Emoji polarity — positive

+0.148

emoji_neg_count

Emoji polarity — negative

−0.049

Architectural Implication

"The word lexicon is the single strongest feature in the model at −4.08, outweighing every TF-IDF unigram. Meanwhile, emoji coefficients are surprisingly weak (+0.15, −0.05) — the model treats word signal as more reliable than emoji signal. This is precisely why the sarcasm veto must exist as a deterministic rule rather than a learned weight: the model alone cannot resolve emoji-text conflict."

07 Inference Pipeline

The finalised system is encapsulated into a two-layer inference pipeline. Frozen artifacts guarantee that every live input undergoes the exact transformation sequence validated during training — followed by a deterministic veto layer that overrides the model when emoji-text conflict is detected.

Model Layer — Statistical Prediction

Frozen sentiment_model.pkl and tfidf_vectorizer.pkl reconstruct the 1,269-feature matrix at inference time. The model returns a probability estimate and a provisional label.

↓

Veto Layer — Deterministic Override

If emoji_neg_count > 0 and the model predicted Positive, the veto fires — overriding to Negative. A word guard suppresses the veto when positive text overwhelms the emoji signal. This layer exists because the model's emoji coefficients are too weak (+0.15, −0.05) to resolve conflict reliably on their own.

Inference Blueprint predict.py

# 1. Load frozen artifacts

model = load("sentiment_model.pkl")

tfidf = load("tfidf_vectorizer.pkl")

# 2. Assemble 1,269-feature matrix

X = hstack([tfidf.transform(text), emoji_pol, word_lex])

# 3. Model prediction

proba = model.predict_proba(X) → [0.07, 0.93]

# 4. Veto check

if emoji_neg_count > 0 and label == "Positive":

label = "Negative" # unless word guard fires

# 5. Return 8-key response

→ { prediction, confidence, veto_applied,

entropy_flag, top_drivers, timestamp, … }

Veto Verification — End-to-End Validation

"i love having bugs 😭"

Sarcasm — veto fires

Negative ✓

conf 0.9318

"this is so beautiful and amazing 😭"

Pos-overwhelm — word guard suppresses veto

Positive ✓

conf 0.7478

"RIP grandma 😭"

Genuine sad — veto fires correctly

Negative ✓

conf 0.9318

"this is the worst :)"

Known limitation — emoticon sarcasm undetected

Positive ✗

conf 0.8479

Validation Status

"End-to-end validation confirms schema integrity across all prediction categories — 8 keys present in every response, audit trail consistent, veto logic verified on sarcasm and genuine-sad cases. One known limitation remains: emoticon-based sarcasm bypasses the veto since :) is invisible to the Unicode emoji extraction pipeline."

Statistical Foundation

Vectorization Strategy

$$ \mathrm{tf\mbox{-}idf}_{t,d} = \mathrm{tf}_{t,d}\cdot \log\frac{N}{\mathrm{df}_t} $$

TF-IDF emphasizes high-entropy emotional markers while down-weighting frequent grammatical structures.

Classification Model

$$ P(y=1 \mid x) = \frac{1}{1 + e^{-\beta \cdot x}} $$

Logistic Regression models sentiment as a linear decision boundary over weighted linguistic and emoji-derived features, preserving coefficient interpretability while supporting probabilistic inference.

Execution Environment

📊

Scikit-Learn

ML Pipeline · TF-IDF + LogReg

🐼

Pandas & NumPy

Data Engineering · Vector Ops

⚡

FastAPI

Async Inference · High-Throughput

🐳

Docker & GCP

Containerized Cloud Runtime

Environment Parity: Python 3.10+ (Verified)

Model Status Production

Accuracy

82.5%

1,000-tweet corpus

F1 Score

0.840

weighted avg

Features

1,269

dims per input

Latency

<15ms

per inference

Active R&D: neutral-class performance · next iteration targets ambiguous inputs

Emoji-Aware

Sentiment Analysis

Engine Output Decoder

Classification

Confidence Score

Top 3 Drivers

Signal Stability

System Overview

Data Scale & Engineering Constraints

Source Data Ingestion

Dataset 01 · Text Corpus (Cleaned)

Dataset 02 · Emoji Reference Table (Cleaned)

Dataset 03 · Final Feature Matrix (final dataset)

Technical Specification: Multi-Dimensional Polarity Vector

Research & Development

Supervised Corpus

Emoji Reference Table

Structural Alignment

Preservation Guardrails

Hard Validation Assertions

49.3% of Corpus

Coverage Gap — The Critical Finding

Expanded Emoji Polarity

Emoticon Fallback

Word Lexicon

Deterministic Sarcasm Veto

Model Layer — Statistical Prediction

Veto Layer — Deterministic Override

Under the Hood

01. Extraction

02. Inference Gate

03. Statistical Model

04. Calibration

You've seen the math.
Now run the inference.

Signal Library

System Overview

Data Scale & Engineering Constraints

Source Data Ingestion

Dataset 01 · Text Corpus (Cleaned)

Dataset 02 · Emoji Reference Table (Cleaned)

Dataset 03 · Final Feature Matrix (final dataset)

Technical Specification: Multi-Dimensional Polarity Vector

Research & Development

Supervised Corpus

Emoji Reference Table

Structural Alignment

Preservation Guardrails

Hard Validation Assertions

49.3% of Corpus

Coverage Gap — The Critical Finding

Expanded Emoji Polarity

Emoticon Fallback

Word Lexicon

Deterministic Sarcasm Veto

Model Layer — Statistical Prediction

Veto Layer — Deterministic Override

Under the Hood

01. Extraction

02. Inference Gate

03. Statistical Model

04. Calibration

You've seen the math. Now run the inference.

Signal Library

You've seen the math.
Now run the inference.