
Python
Data processing transforms raw data into meaningful information. Learn the fundamental concepts and workflows.
Data processing is the conversion of raw data into usable information through a series of steps. Every data analysis project follows a similar pipeline.
# Simple data processing workflow
raw_data = [1, 2, 2, 3, 3, 3, 4, 5]
print(f"Raw data: {raw_data}")
print(f"Count: {len(raw_data)}")
print(f"Average: {sum(raw_data) / len(raw_data)}")
print(f"Unique values: {set(raw_data)}")Every data processing task follows these steps:
1. Collection: Gather raw data from sources (files, APIs, databases)
2. Cleaning: Remove errors, handle missing values, fix inconsistencies
3. Transformation: Convert data into useful formats
4. Analysis: Extract patterns and insights
5. Visualization: Present results clearly
# Pipeline example
students = [
{"name": "Alice", "score": 95},
{"name": "Bob", "score": None}, # Missing data
{"name": "Carol", "score": 87}
]
# Clean: Handle missing values
for student in students:
if student["score"] is None:
student["score"] = 0
# Transform: Extract just scores
scores = [s["score"] for s in students]
# Analyze
avg_score = sum(scores) / len(scores)
print(f"Average score: {avg_score}")Data comes in different formats that require different processing approaches.
# CSV format: plain text table
csv_data = """name,age,city
Alice,25,New York
Bob,30,London
Carol,28,Paris"""
# JSON format: structured data
import json
json_data = '{"users": [{"name": "Alice", "age": 25}]}'
parsed = json.loads(json_data)
# Dictionary format: Python native
dict_data = {
"Alice": {"age": 25, "city": "New York"},
"Bob": {"age": 30, "city": "London"}
}
print(f"JSON users: {parsed['users']}")
print(f"First user: {dict_data['Alice']}")# Count occurrences
data = ["apple", "banana", "apple", "cherry", "banana", "apple"]
counts = {}
for item in data:
counts[item] = counts.get(item, 0) + 1
print(counts) # {'apple': 3, 'banana': 2, 'cherry': 1}
# Filter data
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
even = [n for n in numbers if n % 2 == 0]
print(even) # [2, 4, 6, 8, 10]
# Transform data
prices = [10, 20, 30]
with_tax = [p * 1.1 for p in prices]
print(with_tax) # [11.0, 22.0, 33.0]
# Sort by criteria
items = [{"name": "Apple", "price": 1.5}, {"name": "Banana", "price": 0.5}]
by_price = sorted(items, key=lambda x: x["price"])
print(by_price[0]) # {'name': 'Banana', 'price': 0.5}# Raw sales data
sales = [
{"product": "Laptop", "amount": 1000, "date": "2024-01-01"},
{"product": "Mouse", "amount": 25, "date": "2024-01-02"},
{"product": "Laptop", "amount": 1000, "date": "2024-01-03"},
]
# Step 1: Calculate total by product
totals = {}
for sale in sales:
product = sale["product"]
totals[product] = totals.get(product, 0) + sale["amount"]
# Step 2: Find best-selling product
best_product = max(totals, key=totals.get)
print(f"Best seller: {best_product} (${totals[best_product]})")
# Step 3: Calculate average sale
average = sum(s["amount"] for s in sales) / len(sales)
print(f"Average sale: ${average:.2f}")| Tool | Purpose | Use Case |
|---|---|---|
| Pandas | DataFrame manipulation | Tabular data analysis |
| NumPy | Numerical arrays | Fast mathematical operations |
| CSV module | Read/write CSV files | File I/O |
| JSON module | Handle JSON data | API responses |
| Regular Expressions | Text parsing | Pattern matching |
Ready to learn more? CSV Data Handling | Pandas Basics
Resources
Ojasa Mirai
Master AI-powered development skills through structured learning, real projects, and verified credentials. Whether you're upskilling your team or launching your career, we deliver the skills companies actually need.
Learn Deep • Build Real • Verify Skills • Launch Forward