Credit Card Fraud Detection Using Snowflake-Arctic and Streamlit
In today’s rapidly evolving financial landscape, credit card fraud detection has become a critical focus for institutions to safeguard customer transactions. Leveraging the power of Large Language Models (LLMs) can provide a highly effective approach to detecting fraudulent patterns in real-time. In this article, we’ll explore how to build a robust credit card fraud detection system using Snowflake’s Arctic to create the dataset, and Mistral-7B, a compact LLM, for model training. We’ll demonstrate how the Snowflake interface and its native notebooks can streamline this process, offering a powerful end-to-end solution for financial security.
Snowflake Arctic
Snowflake Arctic is a powerful Large Language Model (LLM) that is designed to work seamlessly within the Snowflake Data Cloud ecosystem. It offers high scalability and performance to handle large datasets and complex machine learning workloads. Specifically, Arctic is built to support a variety of tasks including Natural Language Processing (NLP), data analytics, and AI-driven insights, making it particularly useful for large-scale enterprise applications.
Key Features of Snowflake Arctic LLM:
- High Parameter Count: With models like Arctic reaching upwards of 480 billion parameters, it can handle a wide variety of complex language tasks, making it ideal for nuanced data analysis such as contract review, sentiment analysis, and fraud detection.
- Seamless Integration with Snowflake: Arctic is integrated natively into Snowflake, allowing users to leverage Snowflake’s secure and scalable data platform without needing to export data or use external tools. This means you can train, fine-tune, and deploy models directly in your Snowflake instance.
- Custom Dataset Handling: Arctic allows users to create custom datasets from their Snowflake tables, which can be used for training and inference. This flexibility is crucial for businesses that need to adapt models to specific data domains, like financial fraud detection.
- Supports Fine-Tuning: Arctic supports fine-tuning of its models, enabling users to tailor the pre-trained LLMs to specific tasks or industries. This is particularly useful for companies looking to develop domain-specific applications such as credit card fraud detection, where accuracy and real-time analysis are critical.
- Efficiency: Even though Arctic models can be as large as 480B parameters, Snowflake offers smaller, more efficient versions such as Arctic’s lower-parameter models, which are great for quick inferences or when computational resources are limited.
- Performance Monitoring and Optimization: Snowflake’s infrastructure allows users to monitor the performance of their models in real-time, making it easier to optimize LLMs for different tasks, such as reducing false positives in fraud detection systems.
Snowflake Arctic, with its high adaptability and enterprise-grade security, makes it a powerful tool for AI/ML-driven financial solutions, including fraud detection, allowing teams to harness the best of AI without leaving the Snowflake environment.
Following is the algorithm that we are going to follow.
Let’s start!
# Import python packages
import streamlit as st
import pandas as pd
# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()
This initial code snippet sets up the environment for the application. It uses the streamlit
package to build an interactive UI and pandas
for handling data operations. The key part is importing Snowflake's Snowpark
functionality to establish a session with the active Snowflake environment. This session will be used to connect and interact with Snowflake for further data operations, such as loading and manipulating the fraud detection dataset. Snowpark provides efficient ways to run queries and manage data within Snowflake, making it an ideal choice for data processing.
PROMPT = """
Generate 100 examples of credit card transactions in JSON format for fraud detection, where each example has the following structure:
- A `details` object containing the following keys:
- `Transaction_ID`: A unique identifier for the transaction.
- `Transaction_Amount`: The amount of the transaction.
- `Transaction_DateTime`: The date and time of the transaction in the format "YYYY-MM-DD HH:MM:SS".
- `Customer_ID`: A unique identifier for the customer.
- `Card_Type`: The type of card used for the transaction (e.g., "Credit" or "Debit").
- `Merchant_Category`: The merchant category of the transaction (e.g., "Groceries", "Electronics", "Travel").
- `Transaction_Location`: The location of the transaction (e.g., "USA", "Canada", "France").
- `Transaction_Channel`: The channel used for the transaction (e.g., "Online", "In-Store").
- `Device_Type`: The device used to make the transaction (e.g., "Mobile", "Desktop").
- `Previous_Transaction_Amount`: The amount of the most recent previous transaction by the customer.
- `Is_International_Transaction`: A boolean indicating whether the transaction was international.
- `Time_Since_Last_Transaction`: The time in minutes since the customer's last transaction.
- `Account_Age_Days`: The number of days since the customer's account was created.
- `Fraud_History`: A boolean indicating whether the customer has a history of fraud.
- `Transaction_Count_24H`: The number of transactions made by the customer in the last 24 hours.
- `Transaction_Velocity`: The speed at which transactions were made (amount/time).
- `Average_Spend_Per_Transaction`: The average spending by the customer in their previous transactions.
- `Suspicious_Merchant`: A boolean indicating whether the merchant is flagged for suspicious activity.
- `Card_Present`: A boolean indicating whether the card was physically present at the time of the transaction.
- A `label` indicating whether the transaction was fraudulent or not (`"Fraud"` or `"Not Fraud"`).
- A `description` briefly explaining why the transaction was flagged as fraudulent or not, considering the various details.
Each transaction should be realistic and reflect different types of behavior, including fraud attempts, normal activity, and edge cases. Generate 100 of these examples.
Example:
{
"details": {
"Transaction_ID": "T123456",
"Transaction_Amount": 3000.00,
"Transaction_DateTime": "2024-10-15 12:45:00",
"Customer_ID": "CUST12345",
"Card_Type": "Credit",
"Merchant_Category": "Luxury Goods",
"Transaction_Location": "France",
"Transaction_Channel": "Online",
"Device_Type": "Mobile",
"Previous_Transaction_Amount": 50.00,
"Is_International_Transaction": true,
"Time_Since_Last_Transaction": 5,
"Account_Age_Days": 200,
"Fraud_History": false,
"Transaction_Count_24H": 7,
"Transaction_Velocity": 428.57,
"Average_Spend_Per_Transaction": 100.00,
"Suspicious_Merchant": false,
"Card_Present": false
},
"label": "Fraud",
"description": "This transaction was flagged as fraud due to a large international purchase in France, which is unusual for the customer. The amount is significantly higher than previous transactions, and the card was not physically present."
}
Return the output in the form of Python list. Make sure you don't miss even a single key in the output. Return only the Python list, nothing more. Don't forget to complete the list. Make sure there are 10 records. Make sure the json structure is accurate.
"""
from snowflake.cortex import Complete
output = Complete("snowflakes-arctic", PROMPT)
A PROMPT
is defined to generate synthetic credit card transaction data in JSON format, specifically for fraud detection purposes. The structure includes transaction details and a label indicating whether the transaction was fraudulent or not, along with a description explaining the classification.
Key elements:
- PROMPT: This prompt instructs Snowflake’s Arctic LLM to generate 100 realistic examples of credit card transactions, ensuring each example has specific fields relevant to fraud detection.
- Complete Function: The
Complete
function from thesnowflake.cortex
module is invoked to send the prompt to Snowflake's Arctic LLM. It uses thePROMPT
to generate a list of credit card transaction records, complete with all the specified details.
This code automates dataset generation, using Arctic’s LLM to simulate realistic transaction data for use in training a fraud detection model. It significantly reduces the manual effort required to create datasets with complex fields.
import json
details = []
label = []
description = []
count = 0
for out in output:
count+=1
details.append(out['details'])
label.append(out['label'])
description.append(out['description'])
print(count)
We process the generated output from Snowflake’s Arctic LLM and organize it into three separate lists: details
, label
, and description
.
- Details: Extracts and stores the transaction details from each generated record.
- Label: Stores the fraud detection label (
"Fraud"
or"Not Fraud"
) for each transaction. - Description: Collects the description explaining why a transaction was flagged as fraud or not.
The loop increments a counter extracts data from each output
entry, and appends the relevant sections to their respective lists. Additionally, the count is printed after each iteration to track progress. This helps in organizing the dataset into usable components for further analysis.
import pandas as pd
# Create a dictionary
data = {
'details': details,
'label': label,
'description': description
}
# Convert dictionary into DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print(df.head(5))
df.to_csv("creditcardfrauddata.csv")
We organize the processed data into a structured format using a Pandas DataFrame. Here’s what it does:
- Dictionary Creation: A dictionary named
data
is created with keysdetails
,label
, anddescription
, where each key corresponds to the respective lists populated in the previous snippet. - DataFrame Conversion: The dictionary is then converted into a pandas DataFrame. This is useful for efficiently managing and analyzing the generated data.
- Display the DataFrame: The first 5 rows of the DataFrame are printed using
df.head(5)
, providing a preview of the dataset.
This step organizes the synthetic transaction data in a tabular format, making it easier to work with, analyze, or perform operations like filtering or aggregating.
session.sql("CREATE OR REPLACE STAGE my_stage").collect()
# Put the CSV file into the stage
session.file.put('creditcardfrauddata.csv', '@my_stage')
# List files in the stage to confirm upload
print("Files in the stage:")
files = session.sql("LIST @my_stage").collect()
for file in files:
print(file)
create_table_sql = """
CREATE OR REPLACE TABLE credit_card_data (
id string,
details STRING,
label STRING,
description STRING
);
"""
session.sql(create_table_sql).collect()
copy_into_sql = """
COPY INTO credit_card_data
FROM @my_stage/creditcardfrauddata.csv.gz
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"')
ON_ERROR = 'CONTINUE';
"""
# Execute the COPY INTO command
try:
session.sql(copy_into_sql).collect()
print("Data loaded into 'credit_card_data' table successfully!")
except Exception as e:
print("Error loading data into 'credit_card_data':")
print(f"Error Message: {e}")
We manage the upload and loading of a CSV file containing credit card fraud data into a Snowflake database. Here’s a breakdown of its functionality:
- Create Stage: It creates a Snowflake stage named
my_stage
to hold files temporarily during the data loading process. - Upload CSV File: The
session.file.put
command uploads thecreditcardfrauddata.csv
file into the previously created stage, making it accessible for loading into a table. - List Files in the Stage: After the upload, the script lists all files in the stage to confirm that the upload was successful.
- Create Table: A SQL command is defined to create or replace a table named
credit_card_data
with columns forid
,details
,label
, anddescription
. This table structure is designed to accommodate the data format being uploaded. - Copy Data into Table: The
COPY INTO
command loads data from the uploaded CSV file in the stage into thecredit_card_data
table. It specifies that the file format is CSV and handles errors by continuing on error. - Execution and Error Handling: The code executes the
COPY INTO
command within a try-except block to handle potential errors gracefully. If successful, it prints a confirmation message; otherwise, it outputs the error message.
This snippet effectively streamlines the data loading process into Snowflake, facilitating easy access to the generated fraud detection dataset for further analysis or modelling.
Let’s now start with the finetuning engine.
SELECT SNOWFLAKE.CORTEX.FINETUNE(
'CREATE',
'my_tuned_model',
'mistral-7b',
'SELECT details AS prompt, label AS completion FROM credit_card_data'
);
This SQL command fine-tunes a model in Snowflake Cortex using the data from the credit_card_data
table. Here’s a concise explanation of each part:
- Fine-tune Function: The command calls
SNOWFLAKE.CORTEX.FINETUNE
, a Snowflake Cortex function designed for model fine-tuning. - Operation: The first argument,
'CREATE'
, indicates that you want to create a new fine-tuned model. - Model Name: The second argument,
'my_tuned_model'
, specifies the name for the new fine-tuned model. - Base Model: The third argument,
'mistral-7b'
, indicates that you are fine-tuning the Mistral-7B base model. - Data Source: The fourth argument contains a SQL query that selects data from the
credit_card_data
table. It defines the structure for fine-tuning, whichdetails
are used as prompts andlabel
serves as the completion (the expected output).
This command effectively sets up a fine-tuning process for the Mistral-7B model using the credit card fraud dataset, enabling it to better understand the specific context of fraud detection based on the provided examples.
Streamlit Application
from snowflake.cortex import Complete
import streamlit
st.title("Fraud Detection")
question = st.text_input("Enter question", placeholder="Enter the details for transactions", label_visibility="collapsed")
if question:
st.markdown(Complete("my_tuned_model", PROMPT))
Conclusion
In this article, we demonstrated how to build a credit card fraud detection system using Snowflake’s Arctic LLM for dataset creation and the Mistral-7B model for fine-tuning. By leveraging the powerful features of Snowflake, we efficiently generated synthetic transaction data and structured it for analysis. The integration of these tools not only streamlines the dataset preparation process but also enhances the model’s ability to identify fraudulent activities. Ultimately, this approach lays a strong foundation for developing robust fraud detection systems in the financial sector.